You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by jagdeepsingh2 <gi...@git.apache.org> on 2018/10/24 05:38:55 UTC

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

GitHub user jagdeepsingh2 opened a pull request:

    https://github.com/apache/metron/pull/1245

    METRON-1795: Initial Commit for Regular Expressions Parser

    ## Contributor Comments
    Contributing a new general purpose regular expressions based parser.
    
    
    ## Pull Request Checklist
    
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides.  
    
    
    In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
    
    ### For all changes:
    - [ ] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
    **Yes. Jira created for this PR. https://issues.apache.org/jira/browse/METRON-1795**
    - [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    **Yes.**
    - [ ] Has your PR been rebased against the latest commit within the target branch (typically master)?
    **Yes**
    
    
    ### For code changes:
    - [ ] Have you included steps to reproduce the behavior or problem that is being changed or addressed?
    **N/A as this  PR is for a new feature.** 
    - [ ] Have you included steps or a guide to how the change may be verified and tested manually?
    **Yes. Included Junit can be used to test the new parser.**
    - [ ] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
      ```
      mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
      ```
    **Yes.**
    - [ ] Have you written or updated unit tests and or integration tests to verify your changes?
    **I have included the unit tests.**
    - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
    **N/A**
    - [ ] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
    **Yes**
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`:
    
      ```
      cd site-book
      mvn site
      ```
    **Yes.**
    
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request.
    
    Note: This is a follow up for an earlier PR for METRON-1795, which was created and subsequently closed due to corrupted git commits history. Following comments were posted in earlier PR which I am posting here again with my disposition.
    
    @nickwallen commented 27 days ago
    Thanks for the contribution @jagdeepsingh2. To take this any further we need at a minimum the following items.
    
    **An explanation of what itch this scratches (Why is this needed over Grok parser?)**
    This question was answered in the associated jira ticket (https://issues.apache.org/jira/browse/METRON-1795). In a nutshell 
    Allow for more advanced parsing scenarios (specifically, dealing with multiple regex lines for devices that contain several log formats within them)
    Give users and developers of Metron additional options for parsing
    With the new parser chaining and regex routing feature available in Metron, this gives some additional flexibility to logically separate a flow by:
    Regex routing to segregate logs at a device level and handle envelope unwrapping
    This general purpose regex parser to parse an entire device type that contains multiple log formats within the single device (for example, RHEL logs)
    
    Also, as per GrokParser documentation (https://cwiki.apache.org/confluence/display/METRON/Parsing+Topology) it is intended for low volume scenarios only, while we have tested this parser (RegularExpressionsParser) in very high volume scenarios also.
    
    **Documented Instructions on how to use your parser. Include a README.md in your code contribution.**
    I have updated the README.md file in the metron-parsers project.
    
    **A test plan including in your PR description showing us how to spin-up and test your parser**
    I have included the junit test for this parser, included the JavaDoc and also updated the README.md file in the metron-parsers project. The documentation when used in conjunction with unit tests is enough to test and spin-up this parser.
    
    **A description of how you have personally tested this**
    We have unit and integration tested this parser for lots of different devices. This parser has also been successfully running in our production environment for more than six months now.
    
    
    mmiklavc commented 22 days ago
    **@jagdeepsingh2 Some emphasis on the configuration options for this parser would be particularly useful. 
    Please refer to https://github.com/apache/metron/tree/master/metron-platform/metron-parsers for some good examples of how we document existing Metron parsers.**
    Thanks, I have added the documentaiton in current PR now.
    
    
    jagdeepsingh2 commented 19 days ago • 
    @mmiklavc Yeah, I performed a rebase yesterday as I had to pull the latest changes from upstream. What is the best way out? Should I discard this PR and create a fresh and clean PR?
    
    mmiklavc commented 18 days ago
    @jagdeepsingh2 - you could try this - https://stackoverflow.com/questions/134882/undoing-a-git-rebase, but at this point it might be better to just open a new PR bc pushing up to github is going to cause some additional drama as well. You'll want to keep the default checklist that's populated in the description when you open the PR. Please note the comments from @nickwallen and myself regarding what should also be included in your description.
    
    In general, once you've pushed a branch to the public it's better to just git merge, otherwise you can get into trouble like this. We flatten PR's once they're committed to master anyhow.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jagdeepsingh2/metron master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/1245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1245
    
----
commit fefb21c5b74d8021107986e2936017042ae54d0e
Author: jagdeep <ja...@...>
Date:   2018-10-24T04:24:47Z

    METRON-1795: Initial Commit for Regular Expressions Parser

----


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239611104
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    I [opened this JIRA ](https://issues.apache.org/jira/browse/METRON-1926)to fix the parsing infrastructure.  The error message produced should have made it clear that the message failed because it was missing a timestamp, but it does not.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237866676
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    Thanks @jagdeepsingh2 .  I will try and debug a little further myself too.  I want to make sure there are no incompatibilities between your parser and the newer changes introduced by the `ParserRunner`.  Glad there isn't something obviously stupid that I am doing. :)


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239860797
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    Ok, this pr is actually simpler:  https://github.com/apache/metron/pull/1175


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234872602
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,118 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.apache.metron.parsers.regex.RegularExpressionsParser;
    +
    +public class RegularExpressionsParserTest {
    +    private RegularExpressionsParser regularExpressionsParser;
    +    private JSONObject parserConfig;
    +
    +    @Test
    --- End diff --
    
    I have added more unit tests. Header regex being empty is a perfectly valid scenario and I have added a unit test to support that. A missing recordTypeRegex or an invalid regex is not a valid scenario and this invalid config will be detected during topology initialization phase only. I have added relevant unit tests for these scenarios as well.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234871911
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    +    final Map<String, String> oldKeyNewKeyMap = new HashMap<>();
    +    for (final Map.Entry<String, Object> entry : json.entrySet()) {
    +      if (capitalLettersPattern.matcher(entry.getKey()).matches()) {
    +        oldKeyNewKeyMap.put(entry.getKey(), convert(entry.getKey()));
    +      }
    +    }
    +    oldKeyNewKeyMap.forEach((oldKey, newKey) -> json.put(newKey, json.remove(oldKey)));
    +  }
    +
    +  public String convert(String oldKey) {
    +    final List<Character> chars = new ArrayList<>();
    --- End diff --
    
    No specific reason. List<Character> was just catering a specific case. Anyway this convert method does not exist in the refactored code now.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237699908
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists also e.g.
    --- End diff --
    
    Following is an example where regex is a list:
    {```
      "recordType": "STARTSAVECONFIG",
      "regex": [
         	".*(?<deviceName>(?<=\\s).*?(?=\\s\\d{1,7}-\\w{1,10}-\\d{1,7})).*?(?<eventInfo>(? 
     <=\\s\\d{1,7}\\s:\\s).*?(?=$)).*$",
          	".*(?<deviceName>(?<=\\s).*?(?=\\s\\d{1,7}-\\w{1,10}-\\d{1,7})).*?(?<eventInfo>(?<=\\s:\\s).*?(?=$)).*$"
        ]
    }
    ```
    A list should be chosen when there are multiple forms of a particular record type. 
    
    If there is only one form of a record type (for example in case of Cisco ASA), then there is no need to have a list.  **regex** field can be specified in a string as only a single regular expression is required per **recordType**. For example
    
    ```
    {
    "recordType": "APPFW APPFW_FIELDFORMAT",
     "regex": ".*(?<deviceName>(?<=\\s).*?(?=\\s\\d{1,7}-\\w{1,10}-\\d{1,7})).*?(?<ipSrcAddr>(?<=\\s\\d{1,7}\\s:\\s{1,2}).*?(?=\\s)).*?(?<ipSrcPort>(?<=\\s)\\d+(?=\\-)).*?(?<path>(?<=\\-\\w{1,10}\\s).*?(?=\\s)).*?(?<status>(?<=\\s).*?(?=\\s)).*?(?<requestUri>(?<=\\s).*?(?=\\s)).*?(?<eventInfo>(?<=\\s).*?(?=\\s\\<)).*?(?<responseResultString>(?<=\\<).*?(?=\\>)).*$"
    }
    ```


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234873094
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    +    final Map<String, String> oldKeyNewKeyMap = new HashMap<>();
    +    for (final Map.Entry<String, Object> entry : json.entrySet()) {
    +      if (capitalLettersPattern.matcher(entry.getKey()).matches()) {
    +        oldKeyNewKeyMap.put(entry.getKey(), convert(entry.getKey()));
    +      }
    +    }
    +    oldKeyNewKeyMap.forEach((oldKey, newKey) -> json.put(newKey, json.remove(oldKey)));
    +  }
    +
    +  public String convert(String oldKey) {
    +    final List<Character> chars = new ArrayList<>();
    +    for (final char c : oldKey.toCharArray()) {
    +      if (isCapital(c)) {
    +        chars.add('_');
    +        chars.add((char) (c + 32));
    --- End diff --
    
    This code has been removed as part of refactoring. Thanks.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232352458
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    --- End diff --
    
    This doesn't actually throw `ParseException`, which means that line 185 is doing a try/catch that doesn't actually get used.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234872742
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java ---
    @@ -127,5 +127,48 @@ public String getType() {
         }
       }
     
    +   public static enum ParserConfigConstants {
    +    //@formatter:off
    +    RECORD_TYPE("recordType"),
    +    RECORD_TYPE_REGEX("recordTypeRegex"),
    +    REGEX("regex"),
    +    FIELDS("fields"),
    +    MESSAGE_HEADER("messageHeaderRegex"),
    +    ORIGINAL("original_string"),
    +    TIMESTAMP("timestamp"),
    +    CONVERT_CAMELCASE_TO_UNDERSCORE("convertCamelCaseToUnderScore");
    +    //@formatter:on
    +    private final String name;
    +    private static Map<String, ParserConfigConstants> nameToField;
    +
    +    static {
    +      nameToField = new HashMap<>();
    +      for (final ParserConfigConstants f : ParserConfigConstants.values()) {
    +        nameToField.put(f.getName(), f);
    +      }
    +    }
    +
    +
    +    ParserConfigConstants(String name) {
    +      this.name = name;
    +    }
    +
    +    public String getName() {
    +      return name;
    +    }
    +
    +    static {
    --- End diff --
    
    Removed the duplicate.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237716600
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    --- End diff --
    
    I would say not any syslog message is expected to contain these feilds. But it is expected that from **this form** of syslog message, we would extract these fields (processid, fileName, filePath and eventInfo).
    
    This configuration has been extracted from our use case. Our security experts found this form of syslog message to be important from security perspective. Now there could be other forms of syslog messages which we dont care about. 


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237875330
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    If the parser does fail to parse a given message, we need to make sure that the error message kicked out to the error topic has a helpful message, stack trace, etc.  Otherwise, it will be impossible for a user to determine why the parser failed to parse the message. 
    
    While adding the timestamp is probably a good addition,  I don't know that it really solves the problem here.  Right now, I don't really know if the problem is in your parser or in the parser infrastructure, but it is something that I want to make sure we track down.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237585161
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("device_name", "deviceName");
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    +    assertTrue(validate(expectedJson, parsed));
    +
    +  }
    +
    +  @Test
    +  public void testNoMessageHeaderRegex() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsNoMessageHeaderParserConfig.json")
    +            .toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    +    assertTrue(validate(expectedJson, parsed));
    --- End diff --
    
    I don't get why we need this method 'validate' which seems rather complex.  Can't we just let Junit do this?
    
    Instead of building your expected message and then calling validate, you would just do this...
    ```
    assertEquals("55555", parsed.get("ip_src_port"));
    ```



---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232353483
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    +    final Map<String, String> oldKeyNewKeyMap = new HashMap<>();
    +    for (final Map.Entry<String, Object> entry : json.entrySet()) {
    +      if (capitalLettersPattern.matcher(entry.getKey()).matches()) {
    +        oldKeyNewKeyMap.put(entry.getKey(), convert(entry.getKey()));
    +      }
    +    }
    +    oldKeyNewKeyMap.forEach((oldKey, newKey) -> json.put(newKey, json.remove(oldKey)));
    +  }
    +
    +  public String convert(String oldKey) {
    +    final List<Character> chars = new ArrayList<>();
    +    for (final char c : oldKey.toCharArray()) {
    +      if (isCapital(c)) {
    --- End diff --
    
    This could just use `Character.isUpperCase(c)`, to completely eliminate the need for the `isCapital` method.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237707637
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    Actually, parser did parse the message. If you look at the raw_message, it is actually the parsed_message. Nowe certainly there is something weird here. Not sure why REPL thinks that parser failed and not sure why REPL is putting the successfully parsed message into raw_message field. As the parser itself has no relation to raw_message field, I think something is wrong with REPL. This is the parsed message extracted from the REPL output. So certainly REPL got this output from parser. The only way it could have got this output from parser is when parser successfully returned from the **parse** methiod.
    
    ```
    {
        "dst_process_id": "11672",
        "dst_process_name": "sshd",
        "source.type": "regex",
        "device_name": "deviceName",
        "original_string": "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2",
        "event_info": "Accepted publickey",
        "ip_src_port": "55555",
        "dst_user_id": "prod",
        "app_protocol": "ssh2",
        "guid": "edaee82d-02fb-4ec9-9412-5912fa8d4a6f",
        "syslogpriority": "38",
        "timestamp_device_original": "Jun 20 15:01:17",
        "ip_src_addr": "22.22.22.22"
    }
    ```
    
    Regarding changing the configuration to use @Multiline, I will do that.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234872641
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    --- End diff --
    
    Corrected. Thanks


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237876076
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    >  I will update the parser to add a default current system timestamp.
    
    Should the timestamp come from system time or should it come from the syslog timestamp?  The latter seems more correct to me.
    
    



---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r235117335
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java ---
    @@ -127,5 +127,40 @@ public String getType() {
         }
       }
     
    +   public enum ParserConfigConstants {
    --- End diff --
    
    Agreed on the previous change regarding use of `static`. In addition, I think we want to move this list of config parser config constants. The constants you've added are very specific to this function, and the `Constants` class is really intended for more global scope items. You can probably just make this an inner enum in your `RegexParser` class as it's very specific to that class. Alternatively, you might take a look at @merrimanr 's PR for Stellar REST calls for an example of where you can put configuration if you need something more complex - https://github.com/apache/metron/pull/1250/files#diff-1f3a2a3b1b044494c022cca77223c182. Again, I think your best off using an inner enum in this case.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239855809
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    Yes, good point @ottobackwards .
    
    @jagdeepsingh2 - He is referring specifically to the class `Syslog3164ParserIntegrationTest` in that PR.  Should be fairly simple to put together with what you already have.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234871183
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    +    final Map<String, String> oldKeyNewKeyMap = new HashMap<>();
    +    for (final Map.Entry<String, Object> entry : json.entrySet()) {
    +      if (capitalLettersPattern.matcher(entry.getKey()).matches()) {
    +        oldKeyNewKeyMap.put(entry.getKey(), convert(entry.getKey()));
    +      }
    +    }
    +    oldKeyNewKeyMap.forEach((oldKey, newKey) -> json.put(newKey, json.remove(oldKey)));
    +  }
    +
    +  public String convert(String oldKey) {
    +    final List<Character> chars = new ArrayList<>();
    +    for (final char c : oldKey.toCharArray()) {
    +      if (isCapital(c)) {
    --- End diff --
    
    I have refactored the code a little bit. This whole convert method is not needed now. Thanks.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237698406
  
    --- Diff: metron-platform/metron-parsers/src/test/resources/config/RegularExpressionsInvalidParserConfig.json ---
    @@ -0,0 +1,208 @@
    +{
    +  "convertCamelCaseToUnderScore": true,
    +  "messageHeaderRegex": "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestampDeviceOriginal>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<deviceName>(?<=\\s).*?(?=\\s))",
    +  "recordTypeRegex": "(?<dstProcessName>(?<=\\s)\\b(tch-replicant|audispd|syslog|ntpd|sendmail|pure-ftpd|usermod|useradd|anacron|unix_chkpwd|sudo|dovecot|postfix\\/smtpd|postfix\\/smtp|postfix\\/qmgr|klnagent|systemd|(?i)crond(?-i)|clamd|kesl|sshd|run-parts|automount|suexec|freshclam|kernel|vsftpd|ftpd|su)\\b(?=\\[|:))",
    +  "fields": [
    +    {
    +      "recordType": "syslog",
    +      "regex": ".*(?<dstProcessId>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
    +    },
    +    {
    +      "recordType": "pure-ftpd",
    +      "regex": ".*(?<srcUserId>(?<=\\:\\s\\().*?(?=\\)\\s)).*?(?<messageLevel>(?<=\\s\\[).*?(?=\\]\\s)).*?(?<eventInfo>(?<=\\]\\s).*?(?=$))"
    +    },
    +    {
    +      "recordType": "systemd",
    +      "regex": [
    +        ".*(?<eventInfo>(?<=\\ssystemd\\:\\s).*?(?=\\d+)).*?(?<sessionName>(?<=\\sSession\\s).*?(?=\\sof)).*?(?<srcUserId>(?<=\\suser\\s).*?(?=\\.)).*$",
    +        ".*(?<eventInfo>(?<=\\ssystemd\\:\\s).*?(?=\\sof)).*?(?<srcUserId>(?<=\\sof\\s).*?(?=\\.)).*$"
    +      ]
    +    },
    +    {
    +      "recordType": "kesl",
    +      "regex": ".*(?<eventInfo>(?<=\\:).*?(?=$))"
    +    },
    +    {
    +      "recordType": "dovecot",
    +      "regex": [
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=\\:\\suser)).*?(?<srcUserId>(?<=user\\=\\<).*?(?=\\>)).*?(?<rip>(?<=rip\\=).*?(?=,)).*?(?<lip>(?<=lip\\=).*?(?=,)).*?(?<connectionType>(?<=,\\s).*?(?=,)).*?(?<sessionName>(?<=session\\=\\<).*?(?=\\>)).*$",
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=\\:\\srip)).*?(?<rip>(?<=rip\\=).*?(?=,)).*?(?<lip>(?<=lip\\=).*?(?=,)).*?(?<connectionType>(?<=,\\s).*?(?=$))",
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "postfix/smtpd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\]:)).*?(?<eventInfo>(?<=\\:\\s)disconnect(?=\\sfrom)).*?(?<srcHost>(?<=from).*(?=\\[)).*?(?<ipSrcAddr>(?<=\\[).*(?=\\])).*$"
    +      ]
    +    },
    +    {
    +      "recordType": "postfix/smtp",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=smtp\\[).*?(?=\\]:)).*(?<toEmail>(?<=to=#\\<).*?(?=>,)).*(?<relay>(?<=relay=).*?(?=,)).*(?<delay>(?<=delay=).*?(?=,)).*(?<delays>(?<=delays=).*?(?=,)).*(?<dsn>(?<=dsn=).*?(?=,)).*(?<status>(?<=status=).*?(?=\\()).*?(?<dstHost>(?<=connect to).*?(?=\\[)).*?(?<ipDstAddr>(?<=\\[).*?(?=\\])).*?(?<ipDstPort>(?<=\\]:).*?(?=:\\s)).*?(?<eventInfo>(?<=:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=smtp\\[).*?(?=\\]:)).*?(?<dstHost>(?<=connect to).*?(?=\\[)).*?(?<ipDstAddr>(?<=\\[).*?(?=\\])).*(?<ipDstPort>(?<=:).*?(?=\\s)).*(?<eventInfo>(?<=\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "crond",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<commandLine>(?<=CMD\\s\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<eventInfo>(?<=\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<commandLine>(?<=CMD\\s\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "clamd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<subProcess>(?<=\\:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))",
    +        ".*(?<subProcess>(?<=\\:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "run-parts",
    +      "regex": ".*(?<eventInfo>(?<=\\sparts).*?(?=$))"
    +    },
    +    {
    +      "recordType": "sshd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<event_Info>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*(?=:\\s)).*?(?<encryptionAlgorithm>(?<=:\\s).+?(?=\\s)).*(?<correlationId>(?<=\\s).+?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<appProtocol>(?<=Protocol:).*?(?=;)).*?(?<sshClient>(?<=Client:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<appProtocol>(?<=\\]:).*?(?=:)).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=for)).*?(?<dstUserId>(?<=for).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=port)).*?(?<ipSrcPort>(?<=port).*?(?=\\s)).*?(?<appProtocol>(?<=\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\]))]:\\s.*?(?<eventInfo>subsystem.*?(?=by\\suser)).*?(?<srcUserId>(?<=user).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<action>(?<=Received).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=:)).*?(?<eventInfo>(?<=11:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Server\\slistening(?=\\s)).*?(?<ipSrcAddr>(?<=\\son\\s).*?(?=port)).*?(?<ipSrcPort>(?<=port\\s)\\d{1,6}(?=\\.)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Invalid user(?=\\s)).*?(?<dstUserId>(?<=\\s).*?(?=from)).*?(?<ipSrcAddr>(?<=from\\s).*(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<subProcess>(?<=]:\\s).*\\)(?=:)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=for)).*?(?<dstUserId>(?<=\\sfor).*?(?=\\[)).*?(?<subProcess>(?<=\\[).*?(?=\\])).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:\\s)Excess permission or bad ownership on file(?=\\s\\/)).*?(?<filePath>(?<=\\s).*(?=\\/)).*?(?<fileName>(?<=\\/).*(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=;)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=\\d)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=$))"
    --- End diff --
    
    Yes, there are different forms of this raw message. All these expressions will be evaluated in order until a match is found.  Most complex regular expression should appear fist in the list and least complex regular expression should be the last.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232362818
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    --- End diff --
    
    convertions -> conversions


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237574064
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    The configuration contained in `src/test/resources/config/RegularExpressionsParserConfig.json` is hard to grok because it covers so many record types. I would get rid of this JSON file completely.  Actually all of the JSONs that you added in `src/test/resources/config`.
    
    Instead use the @Multiline annotation along with a more focused configuration that precedes each test case.  You don't need 30 different record types defined to test SSHD parsing.  Each test case would be preceded with a @Multiline annotated field containing the configuration for that test case.  
    
    For example your SSHD test might look-like this.
    
    ```
      /**
       * {
       * 	"convertCamelCaseToUnderScore": true,
       * 	"messageHeaderRegex": "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestampDeviceOriginal>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<deviceName>(?<=\\s).*?(?=\\s))",
       * 	"recordTypeRegex": "(?<dstProcessName>(?<=\\s)\\b(tch-replicant|audispd|syslog|ntpd|sendmail|pure-ftpd|usermod|useradd|anacron|unix_chkpwd|sudo|dovecot|postfix\\/smtpd|postfix\\/smtp|postfix\\/qmgr|klnagent|systemd|(?i)crond(?-i)|clamd|kesl|sshd|run-parts|automount|suexec|freshclam|kernel|vsftpd|ftpd|su)\\b(?=\\[|:))",
       * 	"fields": [
       *    {
       * 			"recordType": "sshd",
       * 			"regex": [
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*(?=:\\s)).*?(?<encryptionAlgorithm>(?<=:\\s).+?(?=\\s)).*(?<correlationId>(?<=\\s).+?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<appProtocol>(?<=Protocol:).*?(?=;)).*?(?<sshClient>(?<=Client:).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<appProtocol>(?<=\\]:).*?(?=:)).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=for)).*?(?<dstUserId>(?<=for).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=port)).*?(?<ipSrcPort>(?<=port).*?(?=\\s)).*?(?<appProtocol>(?<=\\s).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\]))]:\\s.*?(?<eventInfo>subsystem.*?(?=by\\suser)).*?(?<srcUserId>(?<=user).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<action>(?<=Received).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=:)).*?(?<eventInfo>(?<=11:).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Server\\slistening(?=\\s)).*?(?<ipSrcAddr>(?<=\\son\\s).*?(?=port)).*?(?<ipSrcPort>(?<=port\\s)\\d{1,6}(?=\\.)).*$",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Invalid user(?=\\s)).*?(?<dstUserId>(?<=\\s).*?(?=from)).*?(?<ipSrcAddr>(?<=from\\s).*(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<subProcess>(?<=]:\\s).*\\)(?=:)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=for)).*?(?<dstUserId>(?<=\\sfor).*?(?=\\[)).*?(?<subProcess>(?<=\\[).*?(?=\\])).*$",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:\\s)Excess permission or bad ownership on file(?=\\s\\/)).*?(?<filePath>(?<=\\s).*(?=\\/)).*?(?<fileName>(?<=\\/).*(?=$))",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=;)).*$",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=\\d)).*$",
       * 				".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=$))"
       * 			]
       *    }
       *   ]
       * }
       */
      @Multiline
      private String testSSHDParse;
    
      @Test
      public void testSSHDParse() throws Exception {
        String message = "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    
        regularExpressionsParser.configure(toJSON(testSSHDParse));
        JSONObject parsed = parse(message);
        // Expected
       ...
      }
    
    ```
    
    This will make it much easier to grok the test cases. We do this in other [parts of the code base](https://github.com/apache/metron/blob/89a2beda4f07911c8b3cd7dee8a2c3426838d161/metron-analytics/metron-profiler-storm/src/test/java/org/apache/metron/profiler/storm/integration/ProfilerIntegrationTest.java#L151).


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237868794
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("device_name", "deviceName");
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    +    assertTrue(validate(expectedJson, parsed));
    +
    +  }
    +
    +  @Test
    +  public void testNoMessageHeaderRegex() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsNoMessageHeaderParserConfig.json")
    +            .toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    +    assertTrue(validate(expectedJson, parsed));
    --- End diff --
    
    > Junit best practices state that maximum one assertion per test case. 
    
    I have never heard that, nor ever, ever followed that. :)  I think every test in Metron has multiple assertions, which are necessary.  
    
    I think best practice may be to test one "thing" at a time, but you may require multiple assertions when testing that one "thing".
    
    I think it is much simpler the way I suggested, but we could probably spend the time on other more important things.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232689398
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,118 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.apache.metron.parsers.regex.RegularExpressionsParser;
    +
    +public class RegularExpressionsParserTest {
    +    private RegularExpressionsParser regularExpressionsParser;
    +    private JSONObject parserConfig;
    +
    +    @Test
    --- End diff --
    
    Can we add some tests for some of the less happy path things? E.g. what if a regex is malformed, does that bubble up reasonably?
    
    Do we we also need some tests for things like, "What if a header regex is empty?" and so on?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239859491
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    The integration tests have a standard setup.  You have to do a few things, off the top of my head they are ( again check the pr and that parser for details):
    
    - write the IntegrationTest that derives from the base 
    - create a default sample configuration for your parser and put it in the configuration area
    - add in the raw and parsed data in the integration testing module data directory for comparison


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237651771
  
    --- Diff: metron-platform/metron-parsers/src/test/resources/config/RegularExpressionsInvalidParserConfig.json ---
    @@ -0,0 +1,208 @@
    +{
    +  "convertCamelCaseToUnderScore": true,
    +  "messageHeaderRegex": "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestampDeviceOriginal>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<deviceName>(?<=\\s).*?(?=\\s))",
    +  "recordTypeRegex": "(?<dstProcessName>(?<=\\s)\\b(tch-replicant|audispd|syslog|ntpd|sendmail|pure-ftpd|usermod|useradd|anacron|unix_chkpwd|sudo|dovecot|postfix\\/smtpd|postfix\\/smtp|postfix\\/qmgr|klnagent|systemd|(?i)crond(?-i)|clamd|kesl|sshd|run-parts|automount|suexec|freshclam|kernel|vsftpd|ftpd|su)\\b(?=\\[|:))",
    +  "fields": [
    +    {
    +      "recordType": "syslog",
    +      "regex": ".*(?<dstProcessId>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
    +    },
    +    {
    +      "recordType": "pure-ftpd",
    +      "regex": ".*(?<srcUserId>(?<=\\:\\s\\().*?(?=\\)\\s)).*?(?<messageLevel>(?<=\\s\\[).*?(?=\\]\\s)).*?(?<eventInfo>(?<=\\]\\s).*?(?=$))"
    +    },
    +    {
    +      "recordType": "systemd",
    +      "regex": [
    +        ".*(?<eventInfo>(?<=\\ssystemd\\:\\s).*?(?=\\d+)).*?(?<sessionName>(?<=\\sSession\\s).*?(?=\\sof)).*?(?<srcUserId>(?<=\\suser\\s).*?(?=\\.)).*$",
    +        ".*(?<eventInfo>(?<=\\ssystemd\\:\\s).*?(?=\\sof)).*?(?<srcUserId>(?<=\\sof\\s).*?(?=\\.)).*$"
    +      ]
    +    },
    +    {
    +      "recordType": "kesl",
    +      "regex": ".*(?<eventInfo>(?<=\\:).*?(?=$))"
    +    },
    +    {
    +      "recordType": "dovecot",
    +      "regex": [
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=\\:\\suser)).*?(?<srcUserId>(?<=user\\=\\<).*?(?=\\>)).*?(?<rip>(?<=rip\\=).*?(?=,)).*?(?<lip>(?<=lip\\=).*?(?=,)).*?(?<connectionType>(?<=,\\s).*?(?=,)).*?(?<sessionName>(?<=session\\=\\<).*?(?=\\>)).*$",
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=\\:\\srip)).*?(?<rip>(?<=rip\\=).*?(?=,)).*?(?<lip>(?<=lip\\=).*?(?=,)).*?(?<connectionType>(?<=,\\s).*?(?=$))",
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "postfix/smtpd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\]:)).*?(?<eventInfo>(?<=\\:\\s)disconnect(?=\\sfrom)).*?(?<srcHost>(?<=from).*(?=\\[)).*?(?<ipSrcAddr>(?<=\\[).*(?=\\])).*$"
    +      ]
    +    },
    +    {
    +      "recordType": "postfix/smtp",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=smtp\\[).*?(?=\\]:)).*(?<toEmail>(?<=to=#\\<).*?(?=>,)).*(?<relay>(?<=relay=).*?(?=,)).*(?<delay>(?<=delay=).*?(?=,)).*(?<delays>(?<=delays=).*?(?=,)).*(?<dsn>(?<=dsn=).*?(?=,)).*(?<status>(?<=status=).*?(?=\\()).*?(?<dstHost>(?<=connect to).*?(?=\\[)).*?(?<ipDstAddr>(?<=\\[).*?(?=\\])).*?(?<ipDstPort>(?<=\\]:).*?(?=:\\s)).*?(?<eventInfo>(?<=:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=smtp\\[).*?(?=\\]:)).*?(?<dstHost>(?<=connect to).*?(?=\\[)).*?(?<ipDstAddr>(?<=\\[).*?(?=\\])).*(?<ipDstPort>(?<=:).*?(?=\\s)).*(?<eventInfo>(?<=\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "crond",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<commandLine>(?<=CMD\\s\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<eventInfo>(?<=\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<commandLine>(?<=CMD\\s\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "clamd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<subProcess>(?<=\\:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))",
    +        ".*(?<subProcess>(?<=\\:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "run-parts",
    +      "regex": ".*(?<eventInfo>(?<=\\sparts).*?(?=$))"
    +    },
    +    {
    +      "recordType": "sshd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<event_Info>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*(?=:\\s)).*?(?<encryptionAlgorithm>(?<=:\\s).+?(?=\\s)).*(?<correlationId>(?<=\\s).+?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<appProtocol>(?<=Protocol:).*?(?=;)).*?(?<sshClient>(?<=Client:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<appProtocol>(?<=\\]:).*?(?=:)).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=for)).*?(?<dstUserId>(?<=for).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=port)).*?(?<ipSrcPort>(?<=port).*?(?=\\s)).*?(?<appProtocol>(?<=\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\]))]:\\s.*?(?<eventInfo>subsystem.*?(?=by\\suser)).*?(?<srcUserId>(?<=user).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<action>(?<=Received).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=:)).*?(?<eventInfo>(?<=11:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Server\\slistening(?=\\s)).*?(?<ipSrcAddr>(?<=\\son\\s).*?(?=port)).*?(?<ipSrcPort>(?<=port\\s)\\d{1,6}(?=\\.)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Invalid user(?=\\s)).*?(?<dstUserId>(?<=\\s).*?(?=from)).*?(?<ipSrcAddr>(?<=from\\s).*(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<subProcess>(?<=]:\\s).*\\)(?=:)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=for)).*?(?<dstUserId>(?<=\\sfor).*?(?=\\[)).*?(?<subProcess>(?<=\\[).*?(?=\\])).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:\\s)Excess permission or bad ownership on file(?=\\s\\/)).*?(?<filePath>(?<=\\s).*(?=\\/)).*?(?<fileName>(?<=\\/).*(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=;)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=\\d)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=$))"
    --- End diff --
    
    Help me understand why you need 17 different regular expressions to parse SSHD records?  Is it just that you see it in 17 different forms?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234870826
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    --- End diff --
    
    Removed the throws clause.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237714079
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("device_name", "deviceName");
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    +    assertTrue(validate(expectedJson, parsed));
    +
    +  }
    +
    +  @Test
    +  public void testNoMessageHeaderRegex() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsNoMessageHeaderParserConfig.json")
    +            .toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    +    assertTrue(validate(expectedJson, parsed));
    --- End diff --
    
    I personally found junit logging to be insufficient. I wanted more information in the logs. Also expectedJson.put("ip_src_port", "55555"); was more concise than its counterpart.  
    
    Other advantage of using this method was it would let you know all the failed scenarios in one run. While a failed JUnit assertion will stop the test case then and there itself. 
    
    Also, Junit best practices state that maximum one assertion per test case. Now if we want to follow this best practice, we will have to write a unit test per field which again does not feel right. Having the validate method let us follow the Junit best practices.
    
    Do you still want me to remove validate method ?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232347905
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java ---
    @@ -127,5 +127,48 @@ public String getType() {
         }
       }
     
    +   public static enum ParserConfigConstants {
    +    //@formatter:off
    +    RECORD_TYPE("recordType"),
    +    RECORD_TYPE_REGEX("recordTypeRegex"),
    +    REGEX("regex"),
    +    FIELDS("fields"),
    +    MESSAGE_HEADER("messageHeaderRegex"),
    +    ORIGINAL("original_string"),
    +    TIMESTAMP("timestamp"),
    +    CONVERT_CAMELCASE_TO_UNDERSCORE("convertCamelCaseToUnderScore");
    +    //@formatter:on
    +    private final String name;
    +    private static Map<String, ParserConfigConstants> nameToField;
    +
    +    static {
    +      nameToField = new HashMap<>();
    +      for (final ParserConfigConstants f : ParserConfigConstants.values()) {
    +        nameToField.put(f.getName(), f);
    +      }
    +    }
    +
    +
    +    ParserConfigConstants(String name) {
    +      this.name = name;
    +    }
    +
    +    public String getName() {
    +      return name;
    +    }
    +
    +    static {
    --- End diff --
    
    This block is a dupe of the one at line 144.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237870087
  
    --- Diff: metron-platform/metron-parsers/src/test/resources/config/RegularExpressionsInvalidParserConfig.json ---
    @@ -0,0 +1,208 @@
    +{
    +  "convertCamelCaseToUnderScore": true,
    +  "messageHeaderRegex": "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestampDeviceOriginal>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<deviceName>(?<=\\s).*?(?=\\s))",
    +  "recordTypeRegex": "(?<dstProcessName>(?<=\\s)\\b(tch-replicant|audispd|syslog|ntpd|sendmail|pure-ftpd|usermod|useradd|anacron|unix_chkpwd|sudo|dovecot|postfix\\/smtpd|postfix\\/smtp|postfix\\/qmgr|klnagent|systemd|(?i)crond(?-i)|clamd|kesl|sshd|run-parts|automount|suexec|freshclam|kernel|vsftpd|ftpd|su)\\b(?=\\[|:))",
    +  "fields": [
    +    {
    +      "recordType": "syslog",
    +      "regex": ".*(?<dstProcessId>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
    +    },
    +    {
    +      "recordType": "pure-ftpd",
    +      "regex": ".*(?<srcUserId>(?<=\\:\\s\\().*?(?=\\)\\s)).*?(?<messageLevel>(?<=\\s\\[).*?(?=\\]\\s)).*?(?<eventInfo>(?<=\\]\\s).*?(?=$))"
    +    },
    +    {
    +      "recordType": "systemd",
    +      "regex": [
    +        ".*(?<eventInfo>(?<=\\ssystemd\\:\\s).*?(?=\\d+)).*?(?<sessionName>(?<=\\sSession\\s).*?(?=\\sof)).*?(?<srcUserId>(?<=\\suser\\s).*?(?=\\.)).*$",
    +        ".*(?<eventInfo>(?<=\\ssystemd\\:\\s).*?(?=\\sof)).*?(?<srcUserId>(?<=\\sof\\s).*?(?=\\.)).*$"
    +      ]
    +    },
    +    {
    +      "recordType": "kesl",
    +      "regex": ".*(?<eventInfo>(?<=\\:).*?(?=$))"
    +    },
    +    {
    +      "recordType": "dovecot",
    +      "regex": [
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=\\:\\suser)).*?(?<srcUserId>(?<=user\\=\\<).*?(?=\\>)).*?(?<rip>(?<=rip\\=).*?(?=,)).*?(?<lip>(?<=lip\\=).*?(?=,)).*?(?<connectionType>(?<=,\\s).*?(?=,)).*?(?<sessionName>(?<=session\\=\\<).*?(?=\\>)).*$",
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=\\:\\srip)).*?(?<rip>(?<=rip\\=).*?(?=,)).*?(?<lip>(?<=lip\\=).*?(?=,)).*?(?<connectionType>(?<=,\\s).*?(?=$))",
    +        ".*(?<subprocess>(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "postfix/smtpd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\]:)).*?(?<eventInfo>(?<=\\:\\s)disconnect(?=\\sfrom)).*?(?<srcHost>(?<=from).*(?=\\[)).*?(?<ipSrcAddr>(?<=\\[).*(?=\\])).*$"
    +      ]
    +    },
    +    {
    +      "recordType": "postfix/smtp",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=smtp\\[).*?(?=\\]:)).*(?<toEmail>(?<=to=#\\<).*?(?=>,)).*(?<relay>(?<=relay=).*?(?=,)).*(?<delay>(?<=delay=).*?(?=,)).*(?<delays>(?<=delays=).*?(?=,)).*(?<dsn>(?<=dsn=).*?(?=,)).*(?<status>(?<=status=).*?(?=\\()).*?(?<dstHost>(?<=connect to).*?(?=\\[)).*?(?<ipDstAddr>(?<=\\[).*?(?=\\])).*?(?<ipDstPort>(?<=\\]:).*?(?=:\\s)).*?(?<eventInfo>(?<=:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=smtp\\[).*?(?=\\]:)).*?(?<dstHost>(?<=connect to).*?(?=\\[)).*?(?<ipDstAddr>(?<=\\[).*?(?=\\])).*(?<ipDstPort>(?<=:).*?(?=\\s)).*(?<eventInfo>(?<=\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "crond",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<commandLine>(?<=CMD\\s\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<eventInfo>(?<=\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<srcUserId>(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?<commandLine>(?<=CMD\\s\\().*?(?=\\))).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "clamd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<subProcess>(?<=\\:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))",
    +        ".*(?<subProcess>(?<=\\:\\s).*?(?=\\:)).*?(?<eventInfo>(?<=\\:).*?(?=$))"
    +      ]
    +    },
    +    {
    +      "recordType": "run-parts",
    +      "regex": ".*(?<eventInfo>(?<=\\sparts).*?(?=$))"
    +    },
    +    {
    +      "recordType": "sshd",
    +      "regex": [
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<event_Info>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*(?=:\\s)).*?(?<encryptionAlgorithm>(?<=:\\s).+?(?=\\s)).*(?<correlationId>(?<=\\s).+?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=\\sfor)).*?(?<dstUserId>(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?<ipSrcAddr>(?<=\\sfrom\\s).*?(?=\\sport)).*?(?<ipSrcPort>(?<=\\sport\\s).*?(?=\\s)).*?(?<appProtocol>(?<=port\\s\\d{1,5}\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<appProtocol>(?<=Protocol:).*?(?=;)).*?(?<sshClient>(?<=Client:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<appProtocol>(?<=\\]:).*?(?=:)).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<ipDstAddr>(?<=Remote:).*?(?=\\-)).*?(?<ipDstPort>(?<=\\-).*?(?=;)).*?(?<encryptionAlgorithm>(?<=Enc:\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=for)).*?(?<dstUserId>(?<=for).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=port)).*?(?<ipSrcPort>(?<=port).*?(?=\\s)).*?(?<appProtocol>(?<=\\s).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\]))]:\\s.*?(?<eventInfo>subsystem.*?(?=by\\suser)).*?(?<srcUserId>(?<=user).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<action>(?<=Received).*?(?=from)).*?(?<ipSrcAddr>(?<=from).*?(?=:)).*?(?<eventInfo>(?<=11:).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Server\\slistening(?=\\s)).*?(?<ipSrcAddr>(?<=\\son\\s).*?(?=port)).*?(?<ipSrcPort>(?<=port\\s)\\d{1,6}(?=\\.)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s)Invalid user(?=\\s)).*?(?<dstUserId>(?<=\\s).*?(?=from)).*?(?<ipSrcAddr>(?<=from\\s).*(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<subProcess>(?<=]:\\s).*\\)(?=:)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=]:\\s)).*(?<eventInfo>(?<=:\\s).*(?=;)).*(?<logname>(?<=logname=).*?(?=\\s)).*(?<dstUserId>(?<=uid=).*?(?=\\s)).*(?<effectiveUserId>(?<=euid=).*?(?=\\s)).*(?<sessionName>(?<=tty=).*?(?=\\s)).*(?<srcUserId>(?<=ruser=).*?(?=\\s)).*(?<ipSrcAddr>(?<=rhost=).*?(?=\\s)).*(?<userId>(?<=user=).*?(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=\\]:\\s).*?(?=for)).*?(?<dstUserId>(?<=\\sfor).*?(?=\\[)).*?(?<subProcess>(?<=\\[).*?(?=\\])).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:\\s)Excess permission or bad ownership on file(?=\\s\\/)).*?(?<filePath>(?<=\\s).*(?=\\/)).*?(?<fileName>(?<=\\/).*(?=$))",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=;)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=\\d)).*$",
    +        ".*(?<dstProcessId>(?<=\\[).*?(?=\\])).*?(?<eventInfo>(?<=:).*?(?=$))"
    --- End diff --
    
    Sorry to repeat, but this really helped my understand things better. Can you add this to the README.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237649469
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    --- End diff --
    
    What is the expected outcome with this `messageHeaderRegex` example?  
    * I should expect this to be run on all record types (both kernel and syslog), right?
    * I should expect each output message to contain 3 fields; syslogPriority, timestamp, syslogHost?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234870775
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java ---
    @@ -127,5 +127,48 @@ public String getType() {
         }
       }
     
    +   public static enum ParserConfigConstants {
    --- End diff --
    
    Removed static from enum declaration.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232354644
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    +    final Map<String, String> oldKeyNewKeyMap = new HashMap<>();
    +    for (final Map.Entry<String, Object> entry : json.entrySet()) {
    +      if (capitalLettersPattern.matcher(entry.getKey()).matches()) {
    +        oldKeyNewKeyMap.put(entry.getKey(), convert(entry.getKey()));
    +      }
    +    }
    +    oldKeyNewKeyMap.forEach((oldKey, newKey) -> json.put(newKey, json.remove(oldKey)));
    +  }
    +
    +  public String convert(String oldKey) {
    +    final List<Character> chars = new ArrayList<>();
    --- End diff --
    
    Is there a reason this constructs `List<Character>` instead of just using a `StringBuffer`?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r240063655
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    I have added the timestamp field to the parser and also have added the more targeted configuration using @Multiline now.
    
    Will try to add Integration tests as well.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239847486
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    When writing a new parser, it is important that you also implement the integration tests.  An example of a parser submittal that does this is : https://github.com/apache/metron/pull/1279


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234871662
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    --- End diff --
    
    Thanks for pointing it out. I have now replaced it with a library function from guava.
    
    `CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_UNDERSCORE, entry.getKey())`


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237649967
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    --- End diff --
    
    What is the expected output here? Should I expect that for any 'syslog' message, there will be 3 fields added; processid, filePath, and fileName?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239608145
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    Hi @jagdeepsingh2 - I was able to get this up and running in a debugger.  Your parser will not parse messages successfully after the changes made in #1213. You are likely using this on an older version of Metron.
    
    The parser must produce a JSONObject that contains both a `timestamp` and `original_string` field based on the [validation performed here.](https://github.com/apache/metron/blob/2ee6cc7e0b448d8d27f56f873e2c15a603c53917/metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/BasicParser.java#L34-L46)
     
    If you add the timestamp like you mentioned it should work.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237715210
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists also e.g.
    +      ```json
    +          "messageHeaderRegex": [
    +          "regular expression 1",
    +          "regular expression 2"
    +          ]
    +      ```
    +      Where **regular expression 1** are valid regular expressions and may have named
    +      groups, which would be extracted into fields. This list will be evaluated in order until a
    +      matching regular expression is found.
    +      
    +      **recordTypeRegex** can be a more advanced regular expression containing named goups. For example
    --- End diff --
    
    Though having named group in recordType is completely optional, still you could want to use a namedGroup in recordType for followring reasons:
    
    1. Since **recordType** regular expression is already getting matched and we are paying the price for a regular expression match already, we can extract certain fields as a by product of this match.
    2. Most likely the recordType field is common across all the messages. Hence having it extracted in the **recordType** (or **messageHeaderRegex**) would reduce the overall complexity of regular expressions in the **regex** field.
    
    Again, it is a personal choice on how to craft your parser configuration. These are just the options given to user.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237650548
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists also e.g.
    --- End diff --
    
    Can you  show me what the examples above would look like as lists?
    
    Why would I choose to use a list versus not use a list?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237708788
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("device_name", "deviceName");
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    --- End diff --
    
    Sure will do that.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237869285
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists also e.g.
    +      ```json
    +          "messageHeaderRegex": [
    +          "regular expression 1",
    +          "regular expression 2"
    +          ]
    +      ```
    +      Where **regular expression 1** are valid regular expressions and may have named
    +      groups, which would be extracted into fields. This list will be evaluated in order until a
    +      matching regular expression is found.
    +      
    +      **recordTypeRegex** can be a more advanced regular expression containing named goups. For example
    --- End diff --
    
    Good description.  Can you add this advice to the documentation?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237718325
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    --- End diff --
    
    It could be failing because this parser does not add "timestamp" in the parsed json. In our usecase we add timestamp using stellar. I will update the parser to add a default current system timestamp.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r240064005
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    --- End diff --
    
    I have added this explanation to the README. Thanks for the suggestion.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232356461
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    --- End diff --
    
    I'm somewhat surprised this method has to be written, but I can't seem to find a solid alternative. Does anyone know of anything that will do camelCase to snake_case?  It seems like the other way around is fairly doable.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by justinleet <gi...@git.apache.org>.
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r232346004
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java ---
    @@ -127,5 +127,48 @@ public String getType() {
         }
       }
     
    +   public static enum ParserConfigConstants {
    --- End diff --
    
    The `static` is unneeded on the `enum`


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234872968
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,118 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.apache.metron.parsers.regex.RegularExpressionsParser;
    +
    +public class RegularExpressionsParserTest {
    +    private RegularExpressionsParser regularExpressionsParser;
    --- End diff --
    
    Not sure what was wrong here. I have configured the google style java code formatter in intelliJ idea. If it was about a line break after class declaration, then I have taken care of that.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237584581
  
    --- Diff: metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.metron.parsers.regex;
    +
    +import org.json.simple.JSONObject;
    +import org.json.simple.parser.JSONParser;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.nio.file.Files;
    +import java.nio.file.Paths;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class RegularExpressionsParserTest {
    +
    +  private RegularExpressionsParser regularExpressionsParser;
    +  private JSONObject parserConfig;
    +
    +  @Before
    +  public void setUp() throws Exception {
    +    regularExpressionsParser = new RegularExpressionsParser();
    +  }
    +
    +  @Test
    +  public void testSSHDParse() throws Exception {
    +    String message =
    +        "<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey for prod from 22.22.22.22 port 55555 ssh2";
    +
    +    parserConfig = getJsonConfig(
    +        Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
    +    regularExpressionsParser.configure(parserConfig);
    +    JSONObject parsed = parse(message);
    +    // Expected
    +    Map<String, Object> expectedJson = new HashMap<>();
    +    expectedJson.put("device_name", "deviceName");
    +    expectedJson.put("dst_process_name", "sshd");
    +    expectedJson.put("dst_process_id", "11672");
    +    expectedJson.put("dst_user_id", "prod");
    +    expectedJson.put("ip_src_addr", "22.22.22.22");
    +    expectedJson.put("ip_src_port", "55555");
    +    expectedJson.put("app_protocol", "ssh2");
    --- End diff --
    
    Can you also ensure that "timestamp" and "original_string" are correctly added to each message?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r234871255
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java ---
    @@ -0,0 +1,427 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +package org.apache.metron.parsers.regex;
    +
    +import java.nio.charset.Charset;
    +import java.text.ParseException;
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.LinkedHashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.TreeSet;
    +import java.util.regex.Matcher;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.metron.common.Constants;
    +import org.apache.metron.parsers.BasicParser;
    +import org.apache.metron.common.Constants.ParserConfigConstants;
    +import org.json.simple.JSONObject;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +//@formatter:off
    +/**
    + * General purpose class to parse unstructured text message into a json object. This class parses
    + * the message as per supplied parser config as part of sensor config. Sensor parser config example:
    + *
    + * <pre>
    + * <code>
    + * "convertCamelCaseToUnderScore": true,
    + * "recordTypeRegex": "(?&lt;process&gt;(?&lt;=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
    + * "messageHeaderRegex": "(?&lt;syslogpriority&gt;(?&lt;=^&lt;)\\d{1,4}(?=&gt;)).*?(?&lt;timestamp>(?&lt;=&gt;)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?&lt;syslogHost&gt;(?&lt;=\\s).*?(?=\\s))",
    + * "fields": [
    + * {
    + * "recordType": "kernel",
    + * "regex": ".*(?&lt;eventInfo&gt;(?&lt;=\\]|\\w\\:).*?(?=$))"
    + * },
    + * {
    + * "recordType": "syslog",
    + * "regex": ".*(?&lt;processid&gt;(?&lt;=PID\\s=\\s).*?(?=\\sLine)).*(?&lt;filePath&gt;(?&lt;=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?&lt;fileName&gt;.*?(?=\")).*(?&lt;eventInfo&gt;(?&lt;=\").*?(?=$))"
    + * }
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Note: messageHeaderRegex could be specified as lists also e.g.
    + *
    + * <pre>
    + * <code>
    + * "messageHeaderRegex": [
    + * "regular expression 1",
    + * "regular expression 2"
    + * ]
    + * </code>
    + * </pre>
    + *
    + * Where <strong>regular expression 1</strong> are valid regular expressions and may have named
    + * groups, which would be extracted into fields. This list will be evaluated in order until a
    + * matching regular expression is found.<br>
    + * <br>
    + *
    + * <strong>Configuration fields explanation</strong>
    + *
    + * <pre>
    + * recordTypeRegex : used to specify a regular expression to distinctly identify a record type.
    + * messageHeaderRegex :  used to specify a regular expression to extract fields from a message part which is common across all the messages.
    + * e.g. rhel logs looks like
    + * <code>
    + * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
    + * </code>
    + * <br>
    + * </pre>
    + *
    + * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common across all messages.
    + * Hence messageHeaderRegex could be used to extract fields from this part.
    + *
    + * fields : json list of objects containing recordType and regex. regex could be a further list e.g.
    + *
    + * <pre>
    + * <code>
    + * "regex":  [ "record type specific regular expression 1",
    + *             "record type specific regular expression 2"]
    + *
    + * </code>
    + * </pre>
    + *
    + * <strong>Limitation</strong> <br>
    + * Currently the named groups in java regular expressions have a limitation. Only following
    + * characters could be used to name a named group. A capturing group can also be assigned a "name",
    + * a named-capturing group, and then be back-referenced later by the "name". Group names are
    + * composed of the following characters. The first character must be a letter.
    + *
    + * <pre>
    + * <code>
    + * The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
    + * The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
    + * The digits '0' through '9' ('\u0030' through '\u0039'),
    + * </code>
    + * </pre>
    + *
    + * This means that an _ (underscore), cannot be used as part of a named group name. E.g. this is an
    + * invalid regular expression <code>.*(?&lt;event_info&gt;(?&lt;=\\]|\\w\\:).*?(?=$))</code>
    + *
    + * However, this limitation can be easily overcome by adding a parser configuration setting.
    + *
    + * <code>
    + *  "convertCamelCaseToUnderScore": true,
    + * <code>
    + * If above property is added to the sensor parser configuration, in parserConfig object, this parser will automatically convert all the camel case property names to underscore seperated.
    + * For example, following convertions will automatically happen:
    + *
    + * <code>
    + * ipSrcAddr -> ip_src_addr
    + * ipDstAddr -> ip_dst_addr
    + * ipSrcPort -> ip_src_port
    + * <code>
    + * etc.
    + */
    +//@formatter:on
    +public class RegularExpressionsParser extends BasicParser {
    +
    +  private static Logger LOG = LoggerFactory.getLogger(RegularExpressionsParser.class);
    +
    +  private static final Charset UTF_8 = Charset.forName("UTF-8");
    +
    +  private List<Map<String, Object>> fields;
    +  private Map<String, Object> parserConfig;
    +  private final Pattern namedGroupPattern = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
    +  Pattern capitalLettersPattern = Pattern.compile("^.*[A-Z]+.*$");
    +  private Pattern recordTypePattern;
    +  private final Set<String> recordTypePatternNamedGroups = new HashSet<>();
    +  private final Map<String, Map<Pattern, Set<String>>> recordTypePatternMap = new HashMap<>();
    +  private final Map<Pattern, Set<String>> syslogPatternsMap = new HashMap<>();
    +
    +  /**
    +   * Parses an unstructured text message into a json object based upon the regular expression
    +   * configuration supplied.
    +   *
    +   * @param rawMessage incoming unstructured raw text.
    +   *
    +   * @return List of json parsed json objects. In this case list will have a single element only.
    +   */
    +  @Override
    +  public List<JSONObject> parse(byte[] rawMessage) {
    +    String originalMessage = null;
    +    try {
    +      originalMessage = new String(rawMessage, UTF_8).trim();
    +      LOG.debug(" raw message. {}", originalMessage);
    +      if (originalMessage.isEmpty()) {
    +        LOG.warn("Message is empty.");
    +        return Arrays.asList(new JSONObject());
    +      }
    +    } catch (final Exception e) {
    +      LOG.error("[Metron] Could not read raw message. {} " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +
    +    try {
    +      final JSONObject parsedJson = new JSONObject();
    +      if (syslogPatternsMap.size() > 0) {
    +        parsedJson.putAll(extractHeaderFields(originalMessage));
    +      }
    +      parsedJson.putAll(parse(originalMessage));
    +      parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage);
    +      applyFieldTransformations(parsedJson);
    +      return Arrays.asList(parsedJson);
    +    } catch (final ParseException e) {
    +      LOG.error("Error occured in parsing. original message : " + originalMessage, e);
    +      throw new RuntimeException(e.getMessage(), e);
    +    }
    +  }
    +
    +  private void applyFieldTransformations(JSONObject parsedJson) {
    +    if (getParserConfig()
    +        .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName()) != null
    +        && (Boolean) getParserConfig()
    +            .get(ParserConfigConstants.CONVERT_CAMELCASE_TO_UNDERSCORE.getName())) {
    +      convertCamelCaseToUnderScore(parsedJson);
    +    }
    +
    +  }
    +
    +  // @formatter:off
    +  /**
    +   * This method is called during the parser initialization. It parses the parser
    +   * configuration and configures the parser accordingly. It then initializes
    +   * instance variables.
    +   *
    +   * @param parserConfig ParserConfig(Map<String, Object>) supplied to the sensor.
    +   * @see org.apache.metron.parsers.interfaces.Configurable#configure(java.util.Map)<br>
    +   *      <br>
    +   */
    +  // @formatter:on
    +  @Override
    +  public void configure(Map<String, Object> parserConfig) {
    +    setParserConfig(parserConfig);
    +    setFields(
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName()));
    +
    +    setRecordTypePattern(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName()));
    +    recordTypePatternNamedGroups.addAll(getNamedGroups(
    +        (String) getParserConfig().get(ParserConfigConstants.RECORD_TYPE_REGEX.getName())));
    +    final List<Map<String, Object>> fields =
    +        (List<Map<String, Object>>) getParserConfig().get(ParserConfigConstants.FIELDS.getName());
    +
    +    configureRecordTypePatterns(fields);
    +
    +    configureMessageHeaderPattern();
    +
    +    validateConfig();
    +  }
    +
    +  private void configureMessageHeaderPattern() {
    +    if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) != null) {
    +      if (getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof List) {
    +        final List<String> syslogPatternList =
    +            (List<String>) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        for (final String syslogPatternStr : syslogPatternList) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      } else if (getParserConfig()
    +          .get(ParserConfigConstants.MESSAGE_HEADER.getName()) instanceof String) {
    +        final String syslogPatternStr =
    +            (String) getParserConfig().get(ParserConfigConstants.MESSAGE_HEADER.getName());
    +        if (StringUtils.isNotBlank(syslogPatternStr)) {
    +          syslogPatternsMap.put(Pattern.compile(syslogPatternStr),
    +              getNamedGroups(syslogPatternStr));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void configureRecordTypePatterns(List<Map<String, Object>> fields) {
    +
    +    for (final Map<String, Object> field : fields) {
    +      if (field.get(ParserConfigConstants.RECORD_TYPE.getName()) != null
    +          && field.get(ParserConfigConstants.REGEX.getName()) != null) {
    +        final String recordType =
    +            ((String) field.get(ParserConfigConstants.RECORD_TYPE.getName())).toLowerCase();
    +        recordTypePatternMap.put(recordType, new LinkedHashMap<Pattern, Set<String>>());
    +        if (field.get(ParserConfigConstants.REGEX.getName()) instanceof List) {
    +          final List<String> regexList =
    +              (List<String>) field.get(ParserConfigConstants.REGEX.getName());
    +          regexList.forEach(s -> {
    +            recordTypePatternMap.get(recordType).put(Pattern.compile(s), getNamedGroups(s));
    +          });
    +        } else if (field.get(ParserConfigConstants.REGEX.getName()) instanceof String) {
    +          recordTypePatternMap.get(recordType).put(
    +              Pattern.compile((String) field.get(ParserConfigConstants.REGEX.getName())),
    +              getNamedGroups((String) field.get(ParserConfigConstants.REGEX.getName())));
    +        }
    +      }
    +    }
    +  }
    +
    +  private void setRecordTypePattern(String recordTypeRegex) {
    +    if (recordTypeRegex != null) {
    +      recordTypePattern = Pattern.compile(recordTypeRegex);
    +    }
    +  }
    +
    +  private JSONObject parse(String originalMessage) throws ParseException {
    +    final JSONObject parsedJson = new JSONObject();
    +    final Optional<String> recordIdentifier = getField(recordTypePattern, originalMessage);
    +    if (recordIdentifier.isPresent()) {
    +      extractNamedGroups(parsedJson, recordIdentifier.get(), originalMessage);
    +    }
    +    /*
    +     * Extract fields(named groups) from record type regular expression
    +     */
    +    final Matcher matcher = recordTypePattern.matcher(originalMessage);
    +    if (matcher.find()) {
    +      for (final String namedGroup : recordTypePatternNamedGroups) {
    +        if (matcher.group(namedGroup) != null) {
    +          parsedJson.put(namedGroup, matcher.group(namedGroup).trim());
    +        }
    +      }
    +    }
    +    return parsedJson;
    +  }
    +
    +  private void extractNamedGroups(Map<String, Object> json, String recordType,
    +      String originalMessage) {
    +    final Map<Pattern, Set<String>> patternMap = recordTypePatternMap.get(recordType.toLowerCase());
    +    if (patternMap != null) {
    +      for (final Map.Entry<Pattern, Set<String>> entry : patternMap.entrySet()) {
    +        final Pattern pattern = entry.getKey();
    +        final Set<String> namedGroups = entry.getValue();
    +        if (pattern != null && namedGroups != null && namedGroups.size() > 0) {
    +          final Matcher m = pattern.matcher(originalMessage);
    +          if (m.matches()) {
    +            LOG.debug("RecordType : {} Trying regex : {} for message : {} ", recordType,
    +                pattern.toString(), originalMessage);
    +            for (final String namedGroup : namedGroups) {
    +              if (m.group(namedGroup) != null) {
    +                json.put(namedGroup, m.group(namedGroup).trim());
    +              }
    +            }
    +            break;
    +          }
    +        }
    +      }
    +    } else {
    +      LOG.warn("No pattern found for record type : {}", recordType);
    +    }
    +  }
    +
    +  public Optional<String> getField(Pattern pattern, String originalMessage) {
    +    final Matcher matcher = pattern.matcher(originalMessage);
    +    while (matcher.find()) {
    +      return Optional.of(matcher.group());
    +    }
    +    return Optional.empty();
    +  }
    +
    +  private Set<String> getNamedGroups(String regex) {
    +    final Set<String> namedGroups = new TreeSet<>();
    +    final Matcher matcher = namedGroupPattern.matcher(regex);
    +    while (matcher.find()) {
    +      namedGroups.add(matcher.group(1));
    +    }
    +    return namedGroups;
    +  }
    +
    +  private Map<String, Object> extractHeaderFields(String originalMessage) {
    +    final Map<String, Object> syslogJson = new JSONObject();
    +    for (final Map.Entry<Pattern, Set<String>> syslogPatternEntry : syslogPatternsMap.entrySet()) {
    +      final Matcher m = syslogPatternEntry.getKey().matcher(originalMessage);
    +      if (m.find()) {
    +        for (final String namedGroup : syslogPatternEntry.getValue()) {
    +          if (StringUtils.isNotBlank(m.group(namedGroup))) {
    +            syslogJson.put(namedGroup, m.group(namedGroup).trim());
    +          }
    +        }
    +        break;
    +      }
    +    }
    +    return syslogJson;
    +  }
    +
    +  @Override
    +  public void init() {
    +    LOG.info("RegularExpressions parser initialised.");
    +  }
    +
    +  public void validateConfig() {
    +    if (getFields() == null) {
    +      LOG.error("Invalid config :  fields is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :fields is missing in parserConfig");
    +    }
    +    if (recordTypePattern == null) {
    +      LOG.error("Invalid config :recordTypeRegex is missing in parserConfig");
    +      throw new IllegalStateException("Invalid config :recordTypeRegex is missing in parserConfig");
    +    }
    +  }
    +
    +  private void convertCamelCaseToUnderScore(Map<String, Object> json) {
    --- End diff --
    
    Thanks for pointing it out. I have refactored it to use a library function from guava.
    
    `CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_UNDERSCORE, entry.getKey())`


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r235193333
  
    --- Diff: metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java ---
    @@ -127,5 +127,40 @@ public String getType() {
         }
       }
     
    +   public enum ParserConfigConstants {
    --- End diff --
    
    As suggested, moved ParserConfigConstants as inner enum in the RegularExpressionsParser class.


---

[GitHub] metron issue #1245: METRON-1795: Initial Commit for Regular Expressions Pars...

Posted by mraliagha <gi...@git.apache.org>.
Github user mraliagha commented on the issue:

    https://github.com/apache/metron/pull/1245
  
    @mmiklavc Thank you for the review. Is there anything else needs to be addressed for this PR or it can be closed?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237869538
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    --- End diff --
    
    Thanks for the explanation.  Can you add these details to the README?


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237715657
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    --- End diff --
    
    1. Yes, messageHeaderRegex is run on all the messages. 
    2. Yes, all the messages are expected to contain three fields in this case.
    So messageHeaderRegex is a sort of HCF in all messages.


---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237648142
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists also e.g.
    +      ```json
    +          "messageHeaderRegex": [
    +          "regular expression 1",
    +          "regular expression 2"
    +          ]
    +      ```
    +      Where **regular expression 1** are valid regular expressions and may have named
    +      groups, which would be extracted into fields. This list will be evaluated in order until a
    +      matching regular expression is found.
    +      
    +      **recordTypeRegex** can be a more advanced regular expression containing named goups. For example
    --- End diff --
    
    Why would I want to use named groups in the `recordTypeRegex`?  I thought the purpose was to return a record type?  If I want to add fields, wouldn't I just add a regex to the `fields` parameter?


---

[GitHub] metron issue #1245: METRON-1795: Initial Commit for Regular Expressions Pars...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/1245
  
    Given we have the 5424 parser, and the 3164 parser in PR already, with chaining, perhaps this parser would be cleaner and easier to configure and understand if it was re-positioned ( wrt syslog ) as being a chained parser, that parser the MSG portion of either upstream parser.
    
    Then your examples could be a bit simpler.



---

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

Posted by jagdeepsingh2 <gi...@git.apache.org>.
Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239664781
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores in the named group names. So in case your property naming conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] {3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists also e.g.
    +      ```json
    +          "messageHeaderRegex": [
    +          "regular expression 1",
    +          "regular expression 2"
    +          ]
    +      ```
    +      Where **regular expression 1** are valid regular expressions and may have named
    +      groups, which would be extracted into fields. This list will be evaluated in order until a
    +      matching regular expression is found.
    +      
    +      **recordTypeRegex** can be a more advanced regular expression containing named goups. For example
    --- End diff --
    
    Thanks. I will update the documentation.


---