You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by mmiklavc <gi...@git.apache.org> on 2017/06/08 16:57:34 UTC

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

GitHub user mmiklavc opened a pull request:

    https://github.com/apache/metron/pull/614

    METRON-992: Create performance tuning guide

    https://issues.apache.org/jira/browse/METRON-992
    
    This guide covers performance tuning the Metron topologies. I will be leaving this up for a month or so to gather additional insight and feedback from the community.
    
    I have a few additional tweaks to make around formatting and links to other README files, but I wanted to get this in front of the committers sooner than later.
    
    ## Pull Request Checklist
    
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides.  
    
    
    In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). 
    - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`:
    
      ```
      cd site-book
      mvn site
      ```
    
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mmiklavc/metron METRON-992

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/614.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #614
    
----
commit 2f2386c57750cdbf2da2780f001eae0db9f28a9c
Author: Michael Miklavcic <mi...@gmail.com>
Date:   2017-06-08T16:53:36Z

    METRON-992: Create performance tuning guide

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron issue #614: METRON-992: Create performance tuning guide

Posted by dlyle65535 <gi...@git.apache.org>.

Github user dlyle65535 commented on the issue:

    https://github.com/apache/metron/pull/614
  
    Still +1 after the recent commits. Thanks again, Mike!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by mattf-horton <gi...@git.apache.org>.

Github user mattf-horton commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r121001410
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    +
    +## Overview
    +
    +This document provides guidance from our experiences tuning the Apache Metron Storm topologies for maximum performance. You'll find
    +suggestions for optimum configurations under a 1 gbps load along with some guidance around the tooling we used to monitor and assess
    +our throughput.
    +
    +In the simplest terms, Metron is a streaming architecture created on top of Kafka and three main types of Storm topologies: parsers,
    +enrichment, and indexing. Each parser has it's own topology and there is also a highly performant, specialized spout-only topology
    +for streaming PCAP data to HDFS. We found that the architecture can be tuned almost exclusively through using a few primary Storm and
    +Kafka parameters along with a few Metron-specific options. You can think of the data flow as being similar to water flowing through a
    +pipe, and the majority of these options assist in tweaking the various pipe widths in the system.
    +
    +## General Suggestions
    +
    +Note that there is currently no method for specifying the number of tasks from the number of executors in Flux topologies (enrichment,
    + indexing). By default, the number of tasks will equal the number of executors. Logically, setting the number of tasks equal to the number
    +of executors is sensible. Storm enforces # executors <= # tasks. The reason you might set the number of tasks higher than the number of
    +executors is for future performance tuning and rebalancing without the need to bring down your topologies. The number of tasks is fixed
    +at topology startup time whereas the number of executors can be increased up to a maximum value equal to the number of tasks.
    +
    +We found that the default values for poll.timeout.ms, offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all cases.
    +As a general rule, it was optimal to set spout parallelism equal to the number of partitions used in your Kafka topic. Any greater
    +parallelism will leave you with idle consumers since Kafka limits the max number of consumers to the number of partitions. This is
    +important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
    +one consumer in a given consumer group were able to read from that partition.
    +
    +## Tooling
    +
    +Before we get to the actual tooling used to monitor performance, it helps to describe what we might actually want to monitor and potential
    +pain points. Prior to switching over to the new Storm Kafka client, which leverages the new Kafka consumer API under the hood, offsets
    +were stored in Zookeeper. While the broker hosts are still stored in Zookeeper, this is no longer true for the offsets which are now
    +stored in Kafka itself. This is a configurable option, and you may switch back to Zookeeper if you choose, but Metron is currently using
    +the new defaults. This is useful to know as you're investigating both correctness as well as throughput performance.
    +
    +First we need to setup some environment variables
    +```
    +export BROKERLIST=<your broker comma-delimated list of host:ports>
    +export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
    +export KAFKA_HOME=<kafka home dir>
    +export METRON_HOME=<your metron home>
    +export HDP_HOME=<your HDP home>
    +```
    +
    +If you have Kerberos enabled, setup the security protocol
    +```
    +$ cat /tmp/consumergroup.config
    +security.protocol=SASL_PLAINTEXT
    +```
    +
    +Now run the following command for a running topology's consumer group. In this example we are using enrichments.
    +```
    +${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +This will return a table with the following output depicting offsets for all partitions and consumers associated with the specified
    +consumer group:
    +```
    +GROUP                          TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
    +enrichments                    enrichments        9          29746066        29746067        1               consumer-2_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        3          29754325        29754326        1               consumer-1_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        43         29754331        29754332        1               consumer-6_/xxx.xxx.xxx.xxx
    +...
    +```
    +
    +_Note_: You won't see any output until a topology is actually running because the consumer groups only exist while consumers in the
    +spouts are up and running.
    +
    +The primary column we're concerned with paying attention to is the LAG column, which is the current delta calculation between the
    +current and end offset for the partition. This tells us how close we are to keeping up with incoming data. And, as we found through
    +multiple trials, whether there are any problems with specific consumers getting stuck.
    +
    +Taking this one step further, it's probably more useful if we can watch the offsets and lags change over time. In order to do this
    +we'll add a "watch" command and set the refresh rate to 10 seconds.
    +
    +```
    +watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +Every 10 seconds the command will re-run and the screen will be refreshed with new information. The most useful bit is that the
    +watch command will highlight the differences from the current output and the last output screens.
    +
    +We can also monitor our Storm topologies by using the Storm UI - see http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
    +
    +And lastly, you can leverage some GUI tooling to make creating and modifying your Kafka topics a bit easier -
    +see https://github.com/yahoo/kafka-manager
    +
    +## General Knobs and Levers
    --- End diff --
    
    BTW, it may not be obvious to all reviewers that if you press or right-click the "View" button (to the right of the filename in the Files Changed tab) on an .md file under review, you can review the WYSIWYG (Github-MD formatted) version of the file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by simonellistonball <gi...@git.apache.org>.

Github user simonellistonball commented on a diff in the pull request:

https://github.com/apache/metron/pull/614#discussion_r120961066

--- Diff: metron-platform/Performance-tuning-guide.md ---
@@ -0,0 +1,326 @@
+# Metron Performance Tunining Guide
+
+## Overview
+
+This document provides guidance from our experiences tuning the Apache Metron Storm topologies for maximum performance. You'll find
+suggestions for optimum configurations under a 1 gbps load along with some guidance around the tooling we used to monitor and assess
+our throughput.
+
+In the simplest terms, Metron is a streaming architecture created on top of Kafka and three main types of Storm topologies: parsers,
+enrichment, and indexing. Each parser has it's own topology and there is also a highly performant, specialized spout-only topology
+for streaming PCAP data to HDFS. We found that the architecture can be tuned almost exclusively through using a few primary Storm and
+Kafka parameters along with a few Metron-specific options. You can think of the data flow as being similar to water flowing through a
+pipe, and the majority of these options assist in tweaking the various pipe widths in the system.
+
+## General Suggestions
+
+Note that there is currently no method for specifying the number of tasks from the number of executors in Flux topologies (enrichment,
+ indexing). By default, the number of tasks will equal the number of executors. Logically, setting the number of tasks equal to the number
+of executors is sensible. Storm enforces # executors <= # tasks. The reason you might set the number of tasks higher than the number of
+executors is for future performance tuning and rebalancing without the need to bring down your topologies. The number of tasks is fixed
+at topology startup time whereas the number of executors can be increased up to a maximum value equal to the number of tasks.
+
+We found that the default values for poll.timeout.ms, offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all cases.
+As a general rule, it was optimal to set spout parallelism equal to the number of partitions used in your Kafka topic. Any greater
+parallelism will leave you with idle consumers since Kafka limits the max number of consumers to the number of partitions. This is
+important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
+one consumer in a given consumer group were able to read from that partition.
+
+## Tooling
+
+Before we get to the actual tooling used to monitor performance, it helps to describe what we might actually want to monitor and potential
+pain points. Prior to switching over to the new Storm Kafka client, which leverages the new Kafka consumer API under the hood, offsets
+were stored in Zookeeper. While the broker hosts are still stored in Zookeeper, this is no longer true for the offsets which are now
+stored in Kafka itself. This is a configurable option, and you may switch back to Zookeeper if you choose, but Metron is currently using
+the new defaults. This is useful to know as you're investigating both correctness as well as throughput performance.
+
+First we need to setup some environment variables
+```
+export BROKERLIST=<your broker comma-delimated list of host:ports>
+export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
+export KAFKA_HOME=<kafka home dir>
+export METRON_HOME=<your metron home>
+export HDP_HOME=<your HDP home>
+```
+
+If you have Kerberos enabled, setup the security protocol
+```
+$ cat /tmp/consumergroup.config
+security.protocol=SASL_PLAINTEXT
+```
+
+Now run the following command for a running topology's consumer group. In this example we are using enrichments.
+```
+${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
+ --command-config=/tmp/consumergroup.config \
+ --describe \
+ --group enrichments \
+ --bootstrap-server $BROKERLIST \
+ --new-consumer
+```
+
+This will return a table with the following output depicting offsets for all partitions and consumers associated with the specified
+consumer group:
+```
+GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
+enrichments enrichments 9 29746066 29746067 1 consumer-2_/xxx.xxx.xxx.xxx
+enrichments enrichments 3 29754325 29754326 1 consumer-1_/xxx.xxx.xxx.xxx
+enrichments enrichments 43 29754331 29754332 1 consumer-6_/xxx.xxx.xxx.xxx
+...
+```
+
+_Note_: You won't see any output until a topology is actually running because the consumer groups only exist while consumers in the
+spouts are up and running.
+
+The primary column we're concerned with paying attention to is the LAG column, which is the current delta calculation between the
+current and end offset for the partition. This tells us how close we are to keeping up with incoming data. And, as we found through
+multiple trials, whether there are any problems with specific consumers getting stuck.
+
+Taking this one step further, it's probably more useful if we can watch the offsets and lags change over time. In order to do this
+we'll add a "watch" command and set the refresh rate to 10 seconds.
+
+```
+watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
+ --command-config=/tmp/consumergroup.config \
+ --describe \
+ --group enrichments \
+ --bootstrap-server $BROKERLIST \
+ --new-consumer
+```
+
+Every 10 seconds the command will re-run and the screen will be refreshed with new information. The most useful bit is that the
+watch command will highlight the differences from the current output and the last output screens.
+
+We can also monitor our Storm topologies by using the Storm UI - see http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
+
+And lastly, you can leverage some GUI tooling to make creating and modifying your Kafka topics a bit easier -
+see https://github.com/yahoo/kafka-manager
+
+## General Knobs and Levers
+
+Kafka
+ - # partitions
+Storm
+ Kafka
+ - polling frequency and timeouts
+ - # workers
+ - ackers
+ - max spout pending
+ - spout parallelism
+ - bolt parallelism
+ - # executors
+Metron
+ - bolt cache size - handles how many messages can be cached. This cache is used while waiting for all parts of the message to be rejoined.
+
+## Topologies
+
+### Parsers
+
+The parsers and PCAP use a builder utility, as opposed to enrichments and indexing, which use Flux.
+
+We set the number of partitions for our inbound Kafka topics to 48.
+
+```
+$ cat ~metron/.storm/storm-bro.config
+
+{
+ ...
+ "topology.max.spout.pending" : 2000
+ ...
+}
+```
+
+These are the spout recommended defaults from Storm and are currently the defaults provided in the Kafka spout itself.
+In fact, if you find the recommended defaults work fine for you, then this file might not be necessary at all.
+```
+$ cat ~/.storm/spout-bro.config
+{
+ ...
+ "spout.pollTimeoutMs" : 200,
+ "spout.maxUncommittedOffsets" : 10000000,
+ "spout.offsetCommitPeriodMs" : 30000
+}
+```
+
+We ran our bro parser topology with the following options
--- End diff --

Is there any explanation of how we got to these numbers? There will be heavily dependent on hardware, particularly cpu and disk configuration, so the context could be very important.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron issue #614: METRON-992: Create performance tuning guide

Posted by dlyle65535 <gi...@git.apache.org>.

Github user dlyle65535 commented on the issue:

    https://github.com/apache/metron/pull/614
  
    +1. Used this with a largish instance, worked well, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by mmiklavc <gi...@git.apache.org>.

Github user mmiklavc commented on a diff in the pull request:

https://github.com/apache/metron/pull/614#discussion_r120979346

Woops, yes.

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/metron/pull/614


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by justinleet <gi...@git.apache.org>.

Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r120964036
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    --- End diff --
    
    "Tunining" -> "Tuning"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by simonellistonball <gi...@git.apache.org>.

Github user simonellistonball commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r120961228
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    +
    +## Overview
    +
    +This document provides guidance from our experiences tuning the Apache Metron Storm topologies for maximum performance. You'll find
    +suggestions for optimum configurations under a 1 gbps load along with some guidance around the tooling we used to monitor and assess
    +our throughput.
    +
    +In the simplest terms, Metron is a streaming architecture created on top of Kafka and three main types of Storm topologies: parsers,
    +enrichment, and indexing. Each parser has it's own topology and there is also a highly performant, specialized spout-only topology
    +for streaming PCAP data to HDFS. We found that the architecture can be tuned almost exclusively through using a few primary Storm and
    +Kafka parameters along with a few Metron-specific options. You can think of the data flow as being similar to water flowing through a
    +pipe, and the majority of these options assist in tweaking the various pipe widths in the system.
    +
    +## General Suggestions
    +
    +Note that there is currently no method for specifying the number of tasks from the number of executors in Flux topologies (enrichment,
    + indexing). By default, the number of tasks will equal the number of executors. Logically, setting the number of tasks equal to the number
    +of executors is sensible. Storm enforces # executors <= # tasks. The reason you might set the number of tasks higher than the number of
    +executors is for future performance tuning and rebalancing without the need to bring down your topologies. The number of tasks is fixed
    +at topology startup time whereas the number of executors can be increased up to a maximum value equal to the number of tasks.
    +
    +We found that the default values for poll.timeout.ms, offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all cases.
    +As a general rule, it was optimal to set spout parallelism equal to the number of partitions used in your Kafka topic. Any greater
    +parallelism will leave you with idle consumers since Kafka limits the max number of consumers to the number of partitions. This is
    +important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
    +one consumer in a given consumer group were able to read from that partition.
    +
    +## Tooling
    +
    +Before we get to the actual tooling used to monitor performance, it helps to describe what we might actually want to monitor and potential
    +pain points. Prior to switching over to the new Storm Kafka client, which leverages the new Kafka consumer API under the hood, offsets
    +were stored in Zookeeper. While the broker hosts are still stored in Zookeeper, this is no longer true for the offsets which are now
    +stored in Kafka itself. This is a configurable option, and you may switch back to Zookeeper if you choose, but Metron is currently using
    +the new defaults. This is useful to know as you're investigating both correctness as well as throughput performance.
    +
    +First we need to setup some environment variables
    +```
    +export BROKERLIST=<your broker comma-delimated list of host:ports>
    +export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
    +export KAFKA_HOME=<kafka home dir>
    +export METRON_HOME=<your metron home>
    +export HDP_HOME=<your HDP home>
    +```
    +
    +If you have Kerberos enabled, setup the security protocol
    +```
    +$ cat /tmp/consumergroup.config
    +security.protocol=SASL_PLAINTEXT
    +```
    +
    +Now run the following command for a running topology's consumer group. In this example we are using enrichments.
    +```
    +${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +This will return a table with the following output depicting offsets for all partitions and consumers associated with the specified
    +consumer group:
    +```
    +GROUP                          TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
    +enrichments                    enrichments        9          29746066        29746067        1               consumer-2_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        3          29754325        29754326        1               consumer-1_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        43         29754331        29754332        1               consumer-6_/xxx.xxx.xxx.xxx
    +...
    +```
    +
    +_Note_: You won't see any output until a topology is actually running because the consumer groups only exist while consumers in the
    +spouts are up and running.
    +
    +The primary column we're concerned with paying attention to is the LAG column, which is the current delta calculation between the
    +current and end offset for the partition. This tells us how close we are to keeping up with incoming data. And, as we found through
    +multiple trials, whether there are any problems with specific consumers getting stuck.
    +
    +Taking this one step further, it's probably more useful if we can watch the offsets and lags change over time. In order to do this
    +we'll add a "watch" command and set the refresh rate to 10 seconds.
    +
    +```
    +watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +Every 10 seconds the command will re-run and the screen will be refreshed with new information. The most useful bit is that the
    +watch command will highlight the differences from the current output and the last output screens.
    +
    +We can also monitor our Storm topologies by using the Storm UI - see http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
    +
    +And lastly, you can leverage some GUI tooling to make creating and modifying your Kafka topics a bit easier -
    +see https://github.com/yahoo/kafka-manager
    +
    +## General Knobs and Levers
    +
    +Kafka
    +    - # partitions
    +Storm
    +    Kafka
    +        - polling frequency and timeouts
    +    - # workers
    +    - ackers
    +    - max spout pending
    +    - spout parallelism
    +    - bolt parallelism
    +    - # executors
    +Metron
    +    - bolt cache size - handles how many messages can be cached. This cache is used while waiting for all parts of the message to be rejoined.
    +
    +## Topologies
    +
    +### Parsers
    +
    +The parsers and PCAP use a builder utility, as opposed to enrichments and indexing, which use Flux.
    +
    +We set the number of partitions for our inbound Kafka topics to 48.
    +
    +```
    +$ cat ~metron/.storm/storm-bro.config
    +
    +{
    +    ...
    +    "topology.max.spout.pending" : 2000
    +    ...
    +}
    +```
    +
    +These are the spout recommended defaults from Storm and are currently the defaults provided in the Kafka spout itself.
    +In fact, if you find the recommended defaults work fine for you, then this file might not be necessary at all.
    +```
    +$ cat ~/.storm/spout-bro.config
    +{
    +    ...
    +    "spout.pollTimeoutMs" : 200,
    +    "spout.maxUncommittedOffsets" : 10000000,
    +    "spout.offsetCommitPeriodMs" : 30000
    +}
    +```
    +
    +We ran our bro parser topology with the following options
    +
    +```
    +/usr/metron/0.4.0/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s bro -ksp SASL_PLAINTEXT
    +    -ot enrichments
    +    -e ~metron/.storm/storm-bro.config \
    +    -esc ~/.storm/spout-bro.config \
    +    -sp 24 \
    +    -snt 24 \
    +    -nw 1 \
    +    -pnt 24 \
    +    -pp 24 \
    +```
    +
    +From the usage docs, here are the options we've used. The full reference can be found here - https://github.com/apache/metron/blob/master/metron-platform/metron-parsers/README.md
    +```
    +-e,--extra_topology_options <JSON_FILE>        Extra options in the form
    +                                               of a JSON file with a map
    +                                               for content.
    +-esc,--extra_kafka_spout_config <JSON_FILE>    Extra spout config options
    +                                               in the form of a JSON file
    +                                               with a map for content.
    +                                               Possible keys are:
    +                                               retryDelayMaxMs,retryDelay
    +                                               Multiplier,retryInitialDel
    +                                               ayMs,stateUpdateIntervalMs
    +                                               ,bufferSizeBytes,fetchMaxW
    +                                               ait,fetchSizeBytes,maxOffs
    +                                               etBehind,metricsTimeBucket
    +                                               SizeInSecs,socketTimeoutMs
    +-sp,--spout_p <SPOUT_PARALLELISM_HINT>         Spout Parallelism Hint
    +-snt,--spout_num_tasks <NUM_TASKS>             Spout Num Tasks
    +-nw,--num_workers <NUM_WORKERS>                Number of Workers
    +-pnt,--parser_num_tasks <NUM_TASKS>            Parser Num Tasks
    +-pp,--parser_p <PARALLELISM_HINT>              Parser Parallelism Hint
    +```
    +
    +### Enrichment
    +
    +Kafka - partitions setup
    +    bro topic set to 48 partitions (referenced in the parser settings above)
    --- End diff --
    
    worth indicating why 48, or generalising to a function, or reference to number above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by mmiklavc <gi...@git.apache.org>.

Github user mmiklavc commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r120979059
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    +
    +## Overview
    +
    +This document provides guidance from our experiences tuning the Apache Metron Storm topologies for maximum performance. You'll find
    +suggestions for optimum configurations under a 1 gbps load along with some guidance around the tooling we used to monitor and assess
    +our throughput.
    +
    +In the simplest terms, Metron is a streaming architecture created on top of Kafka and three main types of Storm topologies: parsers,
    +enrichment, and indexing. Each parser has it's own topology and there is also a highly performant, specialized spout-only topology
    +for streaming PCAP data to HDFS. We found that the architecture can be tuned almost exclusively through using a few primary Storm and
    +Kafka parameters along with a few Metron-specific options. You can think of the data flow as being similar to water flowing through a
    +pipe, and the majority of these options assist in tweaking the various pipe widths in the system.
    +
    +## General Suggestions
    +
    +Note that there is currently no method for specifying the number of tasks from the number of executors in Flux topologies (enrichment,
    + indexing). By default, the number of tasks will equal the number of executors. Logically, setting the number of tasks equal to the number
    +of executors is sensible. Storm enforces # executors <= # tasks. The reason you might set the number of tasks higher than the number of
    +executors is for future performance tuning and rebalancing without the need to bring down your topologies. The number of tasks is fixed
    +at topology startup time whereas the number of executors can be increased up to a maximum value equal to the number of tasks.
    +
    +We found that the default values for poll.timeout.ms, offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all cases.
    +As a general rule, it was optimal to set spout parallelism equal to the number of partitions used in your Kafka topic. Any greater
    +parallelism will leave you with idle consumers since Kafka limits the max number of consumers to the number of partitions. This is
    +important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
    +one consumer in a given consumer group were able to read from that partition.
    +
    +## Tooling
    +
    +Before we get to the actual tooling used to monitor performance, it helps to describe what we might actually want to monitor and potential
    +pain points. Prior to switching over to the new Storm Kafka client, which leverages the new Kafka consumer API under the hood, offsets
    +were stored in Zookeeper. While the broker hosts are still stored in Zookeeper, this is no longer true for the offsets which are now
    +stored in Kafka itself. This is a configurable option, and you may switch back to Zookeeper if you choose, but Metron is currently using
    +the new defaults. This is useful to know as you're investigating both correctness as well as throughput performance.
    +
    +First we need to setup some environment variables
    +```
    +export BROKERLIST=<your broker comma-delimated list of host:ports>
    +export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
    +export KAFKA_HOME=<kafka home dir>
    +export METRON_HOME=<your metron home>
    +export HDP_HOME=<your HDP home>
    +```
    +
    +If you have Kerberos enabled, setup the security protocol
    +```
    +$ cat /tmp/consumergroup.config
    +security.protocol=SASL_PLAINTEXT
    +```
    +
    +Now run the following command for a running topology's consumer group. In this example we are using enrichments.
    +```
    +${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +This will return a table with the following output depicting offsets for all partitions and consumers associated with the specified
    +consumer group:
    +```
    +GROUP                          TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
    +enrichments                    enrichments        9          29746066        29746067        1               consumer-2_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        3          29754325        29754326        1               consumer-1_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        43         29754331        29754332        1               consumer-6_/xxx.xxx.xxx.xxx
    +...
    +```
    +
    +_Note_: You won't see any output until a topology is actually running because the consumer groups only exist while consumers in the
    +spouts are up and running.
    +
    +The primary column we're concerned with paying attention to is the LAG column, which is the current delta calculation between the
    +current and end offset for the partition. This tells us how close we are to keeping up with incoming data. And, as we found through
    +multiple trials, whether there are any problems with specific consumers getting stuck.
    +
    +Taking this one step further, it's probably more useful if we can watch the offsets and lags change over time. In order to do this
    +we'll add a "watch" command and set the refresh rate to 10 seconds.
    +
    +```
    +watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +Every 10 seconds the command will re-run and the screen will be refreshed with new information. The most useful bit is that the
    +watch command will highlight the differences from the current output and the last output screens.
    +
    +We can also monitor our Storm topologies by using the Storm UI - see http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
    +
    +And lastly, you can leverage some GUI tooling to make creating and modifying your Kafka topics a bit easier -
    +see https://github.com/yahoo/kafka-manager
    +
    +## General Knobs and Levers
    +
    +Kafka
    +    - # partitions
    +Storm
    +    Kafka
    +        - polling frequency and timeouts
    +    - # workers
    +    - ackers
    +    - max spout pending
    +    - spout parallelism
    +    - bolt parallelism
    +    - # executors
    +Metron
    +    - bolt cache size - handles how many messages can be cached. This cache is used while waiting for all parts of the message to be rejoined.
    +
    +## Topologies
    +
    +### Parsers
    +
    +The parsers and PCAP use a builder utility, as opposed to enrichments and indexing, which use Flux.
    +
    +We set the number of partitions for our inbound Kafka topics to 48.
    +
    +```
    +$ cat ~metron/.storm/storm-bro.config
    +
    +{
    +    ...
    +    "topology.max.spout.pending" : 2000
    +    ...
    +}
    +```
    +
    +These are the spout recommended defaults from Storm and are currently the defaults provided in the Kafka spout itself.
    +In fact, if you find the recommended defaults work fine for you, then this file might not be necessary at all.
    +```
    +$ cat ~/.storm/spout-bro.config
    +{
    +    ...
    +    "spout.pollTimeoutMs" : 200,
    +    "spout.maxUncommittedOffsets" : 10000000,
    +    "spout.offsetCommitPeriodMs" : 30000
    +}
    +```
    +
    +We ran our bro parser topology with the following options
    +
    +```
    +/usr/metron/0.4.0/bin/start_parser_topology.sh -k $BROKERLIST -z $ZOOKEEPER -s bro -ksp SASL_PLAINTEXT
    +    -ot enrichments
    +    -e ~metron/.storm/storm-bro.config \
    +    -esc ~/.storm/spout-bro.config \
    +    -sp 24 \
    +    -snt 24 \
    +    -nw 1 \
    +    -pnt 24 \
    +    -pp 24 \
    +```
    +
    +From the usage docs, here are the options we've used. The full reference can be found here - https://github.com/apache/metron/blob/master/metron-platform/metron-parsers/README.md
    +```
    +-e,--extra_topology_options <JSON_FILE>        Extra options in the form
    +                                               of a JSON file with a map
    +                                               for content.
    +-esc,--extra_kafka_spout_config <JSON_FILE>    Extra spout config options
    +                                               in the form of a JSON file
    +                                               with a map for content.
    +                                               Possible keys are:
    +                                               retryDelayMaxMs,retryDelay
    +                                               Multiplier,retryInitialDel
    +                                               ayMs,stateUpdateIntervalMs
    +                                               ,bufferSizeBytes,fetchMaxW
    +                                               ait,fetchSizeBytes,maxOffs
    +                                               etBehind,metricsTimeBucket
    +                                               SizeInSecs,socketTimeoutMs
    +-sp,--spout_p <SPOUT_PARALLELISM_HINT>         Spout Parallelism Hint
    +-snt,--spout_num_tasks <NUM_TASKS>             Spout Num Tasks
    +-nw,--num_workers <NUM_WORKERS>                Number of Workers
    +-pnt,--parser_num_tasks <NUM_TASKS>            Parser Num Tasks
    +-pp,--parser_p <PARALLELISM_HINT>              Parser Parallelism Hint
    +```
    +
    +### Enrichment
    +
    +Kafka - partitions setup
    +    bro topic set to 48 partitions (referenced in the parser settings above)
    --- End diff --
    
    Agreed - the Kafka optimizations began with accounting for the number of disks and nodes as a starting point for experimenting further. I'll add some color around that as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron issue #614: METRON-992: Create performance tuning guide

Posted by cestella <gi...@git.apache.org>.

Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/614
  
    +1 by inspection; this is right on the money and a good first pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron issue #614: METRON-992: Create performance tuning guide

Posted by mmiklavc <gi...@git.apache.org>.

Github user mmiklavc commented on the issue:

    https://github.com/apache/metron/pull/614
  
    Any other additions, comments, questions, concerns on this? I will close this out tomorrow unless I hear otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by mattf-horton <gi...@git.apache.org>.

Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/metron/pull/614#discussion_r121000712

Please spell out "number of", rather than using '#' as an abbreviation, since the sharp sign conflicts with markdown syntax. Better yet, does it make sense to actually give the config parameter name? Or a statement of where and what one changes in the Ambari UI? Thanks.

[GitHub] metron pull request #614: METRON-992: Create performance tuning guide

Posted by justinleet <gi...@git.apache.org>.

Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r120964479
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    +
    +## Overview
    +
    +This document provides guidance from our experiences tuning the Apache Metron Storm topologies for maximum performance. You'll find
    +suggestions for optimum configurations under a 1 gbps load along with some guidance around the tooling we used to monitor and assess
    +our throughput.
    +
    +In the simplest terms, Metron is a streaming architecture created on top of Kafka and three main types of Storm topologies: parsers,
    +enrichment, and indexing. Each parser has it's own topology and there is also a highly performant, specialized spout-only topology
    +for streaming PCAP data to HDFS. We found that the architecture can be tuned almost exclusively through using a few primary Storm and
    +Kafka parameters along with a few Metron-specific options. You can think of the data flow as being similar to water flowing through a
    +pipe, and the majority of these options assist in tweaking the various pipe widths in the system.
    +
    +## General Suggestions
    +
    +Note that there is currently no method for specifying the number of tasks from the number of executors in Flux topologies (enrichment,
    + indexing). By default, the number of tasks will equal the number of executors. Logically, setting the number of tasks equal to the number
    +of executors is sensible. Storm enforces # executors <= # tasks. The reason you might set the number of tasks higher than the number of
    +executors is for future performance tuning and rebalancing without the need to bring down your topologies. The number of tasks is fixed
    +at topology startup time whereas the number of executors can be increased up to a maximum value equal to the number of tasks.
    +
    +We found that the default values for poll.timeout.ms, offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all cases.
    +As a general rule, it was optimal to set spout parallelism equal to the number of partitions used in your Kafka topic. Any greater
    +parallelism will leave you with idle consumers since Kafka limits the max number of consumers to the number of partitions. This is
    +important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
    +one consumer in a given consumer group were able to read from that partition.
    +
    +## Tooling
    +
    +Before we get to the actual tooling used to monitor performance, it helps to describe what we might actually want to monitor and potential
    +pain points. Prior to switching over to the new Storm Kafka client, which leverages the new Kafka consumer API under the hood, offsets
    +were stored in Zookeeper. While the broker hosts are still stored in Zookeeper, this is no longer true for the offsets which are now
    +stored in Kafka itself. This is a configurable option, and you may switch back to Zookeeper if you choose, but Metron is currently using
    +the new defaults. This is useful to know as you're investigating both correctness as well as throughput performance.
    +
    +First we need to setup some environment variables
    +```
    +export BROKERLIST=<your broker comma-delimated list of host:ports>
    +export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
    +export KAFKA_HOME=<kafka home dir>
    +export METRON_HOME=<your metron home>
    +export HDP_HOME=<your HDP home>
    +```
    +
    +If you have Kerberos enabled, setup the security protocol
    +```
    +$ cat /tmp/consumergroup.config
    +security.protocol=SASL_PLAINTEXT
    +```
    +
    +Now run the following command for a running topology's consumer group. In this example we are using enrichments.
    +```
    +${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +This will return a table with the following output depicting offsets for all partitions and consumers associated with the specified
    +consumer group:
    +```
    +GROUP                          TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
    +enrichments                    enrichments        9          29746066        29746067        1               consumer-2_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        3          29754325        29754326        1               consumer-1_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        43         29754331        29754332        1               consumer-6_/xxx.xxx.xxx.xxx
    +...
    +```
    +
    +_Note_: You won't see any output until a topology is actually running because the consumer groups only exist while consumers in the
    +spouts are up and running.
    +
    +The primary column we're concerned with paying attention to is the LAG column, which is the current delta calculation between the
    +current and end offset for the partition. This tells us how close we are to keeping up with incoming data. And, as we found through
    +multiple trials, whether there are any problems with specific consumers getting stuck.
    +
    +Taking this one step further, it's probably more useful if we can watch the offsets and lags change over time. In order to do this
    +we'll add a "watch" command and set the refresh rate to 10 seconds.
    +
    +```
    +watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +Every 10 seconds the command will re-run and the screen will be refreshed with new information. The most useful bit is that the
    +watch command will highlight the differences from the current output and the last output screens.
    +
    +We can also monitor our Storm topologies by using the Storm UI - see http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
    +
    +And lastly, you can leverage some GUI tooling to make creating and modifying your Kafka topics a bit easier -
    +see https://github.com/yahoo/kafka-manager
    +
    +## General Knobs and Levers
    --- End diff --
    
    This displays pretty poorly right now.  Can you make this a bit, add newlines, or something similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---