You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/04/26 15:39:13 UTC
[jira] [Commented] (METRON-119) Move the PCAP topology from HBase

    [ https://issues.apache.org/jira/browse/METRON-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258103#comment-15258103 ] 

ASF GitHub Bot commented on METRON-119:
---------------------------------------

GitHub user cestella opened a pull request:

    https://github.com/apache/incubator-metron/pull/89

    METRON-119 Move PCAP infrastructure from HBase

    As it stands, the existing approach to handling PCAP data has some issues handling high volume packet capture data.  With the advent of a DPDK plugin for capturing packet data, we are going to hit some limitations on the  throughput of consumption if we continue to try to push packet data into HBase at line-speed.
    
    Furthermore, storing PCAP data into HBase limits the range of filter queries that we can perform (i.e. only those expressible within the key).  As of now, we require all fields to be present (source IP/port, destination IP/port and protocol), rather than allowing any wildcards.
    
    To address these issues, we should create a higher performance topology which attaches the appropriate header to the raw packet and timestamp read from Kafka (as placed onto kafka by the packet capture sensor) and appends this packet to a sequence file in HDFS.  The sequence file will be rolled based on number of packets or time (e.g. 1 hrs worth of packets in a given sequence file).
    
    On the query side, we should adjust the middle tier service layer to start a MR job on the appropriate set of sequence files to filter out the appropriate packets.  NOTE: the UI modifications to make this reasonable for the end-user will need to be done in a follow-on JIRA.
    
    In order to test this PR, I would suggest doing the following as the "happy path":
    
    1. Install the pycapa library & utility via instructions [here](https://github.com/apache/incubator-metron/tree/master/metron-sensors/pycapa)
    2. (if using singlenode vagrant) Kill the enrichment and sensor topologies via `for i in bro enrichment yaf snort;do storm kill $i;done`
    3. Start the pcap topology via `/usr/metron/0.1BETA/bin/start_pcap_topology.sh`
    4. Start the pycapa packet capture producer on eth1 via `/usr/bin/pycapa --producer --topic pcap -i eth1 -k node1:6667`
    5. Watch the topology in the [Storm UI](http://node1:8744/index.html) and kill the packet capture utility from before when the number of packets ingested is over 1k.
    6. Ensure that at at least 2 files exist on HDFS by running `hadoop fs -ls /apps/metron/pcap`
    7. Choose a file (denoted by $FILE) and dump a few of the contents using the `pcap_inspector` utility via `/usr/metron/0.1BETA/bin/pcap_inspector.sh -i $FILE -n 5`
    8. Choose one of the lines and note the source ip/port and dest ip/port
    9. Go to the kibana app at [http://node1:5000](http://node1:5000) on the singlenode vagrant (ymmv on ec2) and input that query in the kibana PCAP panel.
    10. Wait patiently while the MR job completes and the results are sent back in the form of a valid PCAP payload suitable for opening in wireshark
    11. Open in wireshark to ensure the payload is valid.
    
    If the payload is not valid PCAP, then please look at the [job history](http://node1:19888/jobhistory) and note the reason for job failure if any.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron pcap_extraction_topology

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/89.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #89
    
----
commit fee3d0d327fccba464a62c75029cdc733a8d2a56
Author: cstella <ce...@gmail.com>
Date:   2016-02-26T14:41:37Z

    Adding pcap infrastructure

commit d6da175e7b7072c585650d5604c7dd886177962c
Author: cstella <ce...@gmail.com>
Date:   2016-02-26T15:47:08Z

    Updating kafka component to be more featureful.

commit 0a9dba8771939c28f65efd585fa501a4b0a4125b
Author: cstella <ce...@gmail.com>
Date:   2016-02-26T22:09:56Z

    Updating topology and integration test.

commit 374b391c29621aac26fd9ef3ad54872f78a6e960
Author: cstella <ce...@gmail.com>
Date:   2016-03-01T15:00:58Z

    Updating integration test.

commit 88d3d1572ca4b812c2438c8219434f2f19f1467d
Author: cstella <ce...@gmail.com>
Date:   2016-03-02T20:34:33Z

    Fixed weird situation with HDFS, made the callback handle multiple partitions, added licenses

commit d485bfa7415fbc8a5c34c3cf327468ce37b07847
Author: cstella <ce...@gmail.com>
Date:   2016-03-03T01:18:50Z

    Updating topology.

commit 8a9706bf012c8041e3a8ea08e8924073cde887f1
Author: cstella <ce...@gmail.com>
Date:   2016-03-03T02:22:11Z

    Merging can be fun, but this one was not.  Merging in master with some overlapping files from my feature branch that made their way into master via another feature.

commit d99cb74892ac2624d77895368874de76edd274d8
Author: cstella <ce...@gmail.com>
Date:   2016-03-14T15:35:55Z

    Merging from master.

commit 3f8daa693decc815c4c0328be9dc6994ae8a4310
Author: cstella <ce...@gmail.com>
Date:   2016-03-14T17:59:56Z

    Updating component runner and integration test.

commit 86771b087d4ef38f87333be5027c4935fa79173e
Author: cstella <ce...@gmail.com>
Date:   2016-03-16T21:00:38Z

    Integrating a proper integration test and service layer call.

commit 3cd17f1823b92661426bf21ea618c50cbb1ae2bf
Author: cstella <ce...@gmail.com>
Date:   2016-03-17T19:36:31Z

    Updating integration test.

commit 52fb7b28163267d4e321a5becc1c4a8e73eff3ea
Author: cstella <ce...@gmail.com>
Date:   2016-03-18T13:05:25Z

    Updating integration test.

commit 6f1e24f96f3fa96319337fda6385babee4ed2abb
Author: cstella <ce...@gmail.com>
Date:   2016-03-18T15:06:10Z

    Updating classpath issues.

commit ae8a5c1f55de5daa467bae7d32977175efc5b4bb
Author: cstella <ce...@gmail.com>
Date:   2016-04-05T19:18:01Z

    Merged master into feature branch.

commit 542ee9e19b9ef2c371f95a8143cad307f6a44347
Author: cstella <ce...@gmail.com>
Date:   2016-04-07T13:35:21Z

    merged master in.

commit 3705c4719b73613c1d8f559672bbbbb31b14ff02
Author: cstella <ce...@gmail.com>
Date:   2016-04-07T17:37:03Z

    Reverting some very bad things that I did.

commit c7f837704f17510ed3881066fd9b50a3ed889f2b
Author: cstella <ce...@gmail.com>
Date:   2016-04-07T21:42:47Z

    Fixing spout config and integration test

commit b25cdaad2cf59f6448fbca368f2c5b0103750735
Author: cstella <ce...@gmail.com>
Date:   2016-04-08T14:35:01Z

    Making this work with pycappa as well.

commit 182c151901de23b6d98435762276cd2802e685ba
Author: cstella <ce...@gmail.com>
Date:   2016-04-08T15:09:36Z

    Updating integration test to work with timestamp in the key as well as timestamp pulled from the data.

commit cc02302f8c4c55b380f3fbbf018ff21e74570819
Author: cstella <ce...@gmail.com>
Date:   2016-04-08T15:34:30Z

    Moved around some stuff and realized I was not using unsigned comparisons.

commit e0d47a5aa94500b0954ae12449a270a5a2022830
Author: cstella <ce...@gmail.com>
Date:   2016-04-11T13:52:41Z

    Headerizing in the converter.

commit 69f49959c470f1b73eb6d579661bcdc257c7010b
Author: cstella <ce...@gmail.com>
Date:   2016-04-11T13:56:42Z

    Still have some weird serialization error, but will fix shortly.

commit f30595d151b823d23e1c8682343aafab6c45a30d
Author: cstella <ce...@gmail.com>
Date:   2016-04-11T20:10:14Z

    Updating converters to implement serializable.

commit 09004e1f4566d4aad4ef349d6ddb013e1991c4b2
Author: cstella <ce...@gmail.com>
Date:   2016-04-19T12:45:10Z

    Merge branch 'master' into pcap_extraction_topology

commit f52e57968b94591f0750659c3546403cd8d56e79
Author: cstella <ce...@gmail.com>
Date:   2016-04-19T21:01:34Z

    Updating next gen pcap to include a notion of endianness that is configurable.

commit bce86caf5047d9fbb42995b90d6e1d1842ee3cb2
Author: cstella <ce...@gmail.com>
Date:   2016-04-19T21:16:22Z

    Added licenses.

commit f8dc3460c6678ba5c0e83e0d7cb21dce854810bc
Author: cstella <ce...@gmail.com>
Date:   2016-04-19T21:30:41Z

    updated licenses and added a global_shade_version because the one in Metron-Common was very old.

commit cb1288697de8da0cbfd6fc3b253ac3cbb40f698e
Author: cstella <ce...@gmail.com>
Date:   2016-04-20T12:54:56Z

    Merge branch 'master' into pcap_extraction_topology

commit dfc3558496740d2429e755a7a23ca18943601e9f
Author: cstella <ce...@gmail.com>
Date:   2016-04-20T16:08:04Z

    Moving stuff out of common.

commit f6e2567f21ef698531568593383ac732c7670a18
Author: cstella <ce...@gmail.com>
Date:   2016-04-20T20:17:39Z

    We don't need to be configurable for the endianness..I can figure that out from the JVM.

----


> Move the PCAP topology from HBase
> ---------------------------------
>
>                 Key: METRON-119
>                 URL: https://issues.apache.org/jira/browse/METRON-119
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: Casey Stella
>            Assignee: Casey Stella
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As it stands, the existing approach to handling PCAP data has some issues handling high volume packet capture data.  With the advent of a DPDK plugin for capturing packet data, we are going to hit some limitations on the  throughput of consumption if we continue to try to push packet data into HBase at line-speed.
> Furthermore, storing PCAP data into HBase limits the range of filter queries that we can perform (i.e. only those expressible within the key).  As of now, we require all fields to be present (source IP/port, destination IP/port and protocol), rather than allowing any wildcards.
> To address these issues, we should create a higher performance topology which attaches the appropriate header to the raw packet and timestamp read from Kafka (as placed onto kafka by the packet capture sensor) and appends this packet to a sequence file in HDFS.  The sequence file will be rolled based on number of packets or time (e.g. 1 hrs worth of packets in a given sequence file).
> On the query side, we should adjust the middle tier service layer to start a MR job on the appropriate set of sequence files to filter out the appropriate packets.  NOTE: the UI modifications to make this reasonable for the end-user will need to be done in a follow-on JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)