You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by cestella <gi...@git.apache.org> on 2018/04/11 19:55:23 UTC

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

GitHub user cestella opened a pull request:

    https://github.com/apache/metron/pull/990

    METRON-1520: Add caching for stellar field transformations

    ## Contributor Comments
    Given how important caching is in the enrichment topology, we should have caching for stellar field transformations in the parsers as well.
    
    Beyond a smoketest ensuring data flows into the indices, you may test this by creating a new parser with a field transformation and ensuring stellar field transformations function as expected.
    
    * Create a new parser by editing `$METRON_HOME/config/zookeeper/parsers/dummy.json` to have the following content:
    ```
    {
      "parserClassName":"org.apache.metron.parsers.json.JSONMapParser",
      "sensorTopic":"dummy",
      "fieldTransformations" : [
             { "transformation" : "STELLAR"
             ,"output" : [ "new_field"]
             ,"config" : {
               "new_field" : "JOIN( [ TO_UPPER(source.type), field ], ',' )"
                         }
             }
      ]
    }
    ```
    * Create the dummy kafka topic via `/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper node1:2181 --create --topic dummy --partitions 1 --replication-factor 1`
    * Push the zookeeper configs via `$METRON_HOME/bin/zk_load_configs.sh --mode PUSH -i $METRON_HOME/config/zookeeper -z node1:2181`
    * Start the parser via `$METRON_HOME/bin/start_parser_topology.sh -k node1:6667 -z node1:2181 -s dummy`
    * Create some dummy data by creating a file at `~/dummy.dat` with the following:
    ```
    { "field" : "f1", "source.type" : "dummy" }
    { "field" : "f2", "source.type" : "dummy" }
    { "field" : "f3", "source.type" : "dummy" }
    { "field" : "f4", "source.type" : "dummy" }
    { "field" : "f5", "source.type" : "dummy" }
    { "field" : "f6", "source.type" : "dummy" }
    { "field" : "f7", "source.type" : "dummy" }
    ```
    * Send the dummy data in via `cat ~/dummy.dat | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic dummy`
    * Ensure the data written to the indices has a field `new_field` that looks like `DUMMY,${field}`
    
    ## Pull Request Checklist
    
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235) for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview) for complete smoke testing guides.  
    
    
    In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
    - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    
    ### For code changes:
    - [x] Have you included steps to reproduce the behavior or problem that is being changed or addressed?
    - [x] Have you included steps or a guide to how the change may be verified and tested manually?
    - [x] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
      ```
      mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
      ```
    
    - [x] Have you written or updated unit tests and or integration tests to verify your changes?
    - [x] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
    - [x] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
    
    ### For documentation related changes:
    - [x] Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via `site-book/target/site/index.html`:
    
      ```
      cd site-book
      mvn site
      ```
    
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci](https://travis-ci.org) is set up for your personal repository such that your branches are built there before submitting a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron parsercache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/990.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #990
    
----
commit 93cc3bc80a01ec2b82a88d2812c17de09d122c6c
Author: cstella <ce...@...>
Date:   2018-04-09T21:45:27Z

    Updating stellar transformations to use caching.

commit c38d5a2a91f9eebf428b40db3c3165152f27a786
Author: cstella <ce...@...>
Date:   2018-04-10T13:46:19Z

    Updating test.

commit 1502e739a833dbe5de76e8a36b19727f846ebbd7
Author: cstella <ce...@...>
Date:   2018-04-10T19:30:41Z

    Updating readme.

commit f0a029d80b95f8d34141623f6ffcf83bdb8bb181
Author: cstella <ce...@...>
Date:   2018-04-10T20:17:05Z

    updating test

commit 14487f0ef3d1aa9690e74a1a04fd8a49eb9a89d6
Author: cstella <ce...@...>
Date:   2018-04-11T14:59:24Z

    Merge branch 'master' into parsercache

commit 936c501e758f06e8bfd0d390cf6c6f755f1f56e8
Author: cstella <ce...@...>
Date:   2018-04-11T19:49:21Z

    Fixing bug

----


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r181853456
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -43,6 +43,16 @@ parallelism will leave you with idle consumers since Kafka limits the max number
     important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
     one consumer in a given consumer group were able to read from that partition.
     
    +## Parser Tuning Suggestions
    +
    +If you are using stellar field transformations in your parsers, by default, stellar expressions
    +are not cached.  Turning on caching via setting the `cacheConfig` [property](metron-parsers#parser_configuration)
    +in your parser configuration can have performance impact if your stellar expressions are
    +complex (e.g. `ENRICHMENT_GET` calls or other high latency calls).  The tradeoff, though, is
    +that non-deterministic stellar expressions will yield cached results which may be wrong,
    +for the period of time in which the data exists in the cache (the max time in the cache is
    +configurable).
    +
    --- End diff --
    
    Thanks for the feedback.  I've accepted your rewordings.


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/990
  
    I reran this in full-dev from scratch and it checks out.  Let me know if there are any more comments @nickwallen and @ottobackwards 


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r183103855
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/Context.java ---
    @@ -36,6 +36,7 @@
         , STELLAR_CONFIG
         , CONSOLE
         , SHELL_VARIABLES
    +    , CACHE
    --- End diff --
    
    If you're so inclined, it would be useful to add some javadoc for `CACHE`.  I know we haven't done it on the others, but we should.


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/990
  
    Does this assume, that if you have multiple transformations that all of them should be cached?
    Does it assume that every stellar function returns the same value with the same input?



---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/990
  
    Ok, @nickwallen I think I reacted to your comments.  I'm going to spin this up one more time since it's churned a bit since I originally submitted.  Let me know if I missed anything.


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r180885231
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java ---
    @@ -99,6 +98,7 @@ public void configure(Map<String, Object> config) {
         String strategyStr = (String) config.getOrDefault(MAP_STRATEGY_CONFIG, MapStrategy.DROP.name());
         mapStrategy = MapStrategy.valueOf(strategyStr);
         if (config.containsKey(JSONP_QUERY)) {
    --- End diff --
    
    Sorry, I wasn't finished with the PRs submitted.  This is not strictly necessary except that I ran into the bug as part of testing this PR.  I submitted a separate PR for this fix in #991 and will remove this.


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/990
  
    +1 Thanks.  Looks good.


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r183109844
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/dsl/Context.java ---
    @@ -36,6 +36,7 @@
         , STELLAR_CONFIG
         , CONSOLE
         , SHELL_VARIABLES
    +    , CACHE
    --- End diff --
    
    I can remember from experience trying to track down what some of these are used for in the past.  A little doc here would have been very helpful.


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r183109303
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -174,6 +174,19 @@ then it is assumed to be a regex and will match any topic matching the pattern (
     * `spoutConfig` : A map representing a custom spout config (this is a map). This can be overridden on the command line.
     * `securityProtocol` : The security protocol to use for reading from kafka (this is a string).  This can be overridden on the command line and also specified in the spout config via the `security.protocol` key.  If both are specified, then they are merged and the CLI will take precedence.
     * `stormConfig` : The storm config to use (this is a map).  This can be overridden on the command line.  If both are specified, they are merged with CLI properties taking precedence.
    +* `cacheConfig` : Cache config for stellar field transformations.   This configures a least frequently used cache.  This is a map with the following keys.  If unconfigured, then no cache will be used.
    --- End diff --
    
    Is "unconfigured" == "default" behavior? I'd like to see mentioned specifically that the default behavior is no cache (at least that's what I am assuming).


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r181188219
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -43,6 +43,16 @@ parallelism will leave you with idle consumers since Kafka limits the max number
     important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
     one consumer in a given consumer group were able to read from that partition.
     
    +## Parser Tuning Suggestions
    +
    +If you are using stellar field transformations in your parsers, by default, stellar expressions
    +are not cached.  Turning on caching via setting the `cacheConfig` [property](metron-parsers#parser_configuration)
    +in your parser configuration can have performance impact if your stellar expressions are
    +complex (e.g. `ENRICHMENT_GET` calls or other high latency calls).  The tradeoff, though, is
    +that non-deterministic stellar expressions will yield cached results which may be wrong,
    +for the period of time in which the data exists in the cache (the max time in the cache is
    +configurable).
    +
    --- End diff --
    
    I think something like this is more straightforward:
    
    "Sensors that use stellar field transformations by see a performance boost by turning on caching via setting the `cacheConfig` [property](metron-parsers#parser_configuration).
    This is beneficial if your transformations:
    - Are complex (e.g. `ENRICHEMENT_GET` calls or other high latency calls)
    - All Yield the same results for the same inputs ( caching is either off or applied to *all* transformations)
    
    If *any* of your transformations are non-deterministic, caching should not be used as it will result in the likelihood of incorrect results being returned."


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/990
  
    Ok, turned off by default and added some more detail to the performance tuning guide.


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r181185837
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -43,6 +43,16 @@ parallelism will leave you with idle consumers since Kafka limits the max number
     important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
     one consumer in a given consumer group were able to read from that partition.
     
    --- End diff --
    
    s/Parser/Sensor/ 
    ?


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/990
  
    Yes, the caching semantics are the same as the enrichment topology; namely that we cache all stellar expressions.  
    
    You can, however, turn caching off in the parsers in the case that you have a nondeterministic stellar expression.  I'd like to be able to augment stellar function annotations to note expressions which are non-deterministic to avoid caching in those cases.
    
    Do you think we should turn it off by default?


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r181853515
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -43,6 +43,16 @@ parallelism will leave you with idle consumers since Kafka limits the max number
     important because Kafka has certain ordering guarantees for message delivery per partition that would not be possible if more than
     one consumer in a given consumer group were able to read from that partition.
     
    --- End diff --
    
    Sure


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/990
  
    @ottobackwards Sounds good; I'll do that.


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/990
  
    I think that the assumptions that it makes about the ability to cache the output, and apply to every transformation or every function may be a bit optimistic.
    
    I would have it off by default but put it in the tuning guide, with something written up about the circumstance where it will work and the circumstances where it will not.
    
    A setting the marks the return as cacheable  in the configuration would be the better option, on a per transformation basis.
    



---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metron/pull/990


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r183108791
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/CachingStellarProcessor.java ---
    @@ -0,0 +1,140 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.stellar.common;
    +
    +import com.github.benmanes.caffeine.cache.Cache;
    +import com.github.benmanes.caffeine.cache.Caffeine;
    +import org.apache.metron.stellar.common.utils.ConversionUtils;
    +import org.apache.metron.stellar.dsl.Context;
    +import org.apache.metron.stellar.dsl.VariableResolver;
    +import org.apache.metron.stellar.dsl.functions.resolver.FunctionResolver;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.concurrent.TimeUnit;
    +
    +/**
    + * The Caching Stellar Processor is a stellar processor that optionally fronts stellar with an expression-by-expression
    + * LFU cache.
    + */
    +public class CachingStellarProcessor extends StellarProcessor {
    --- End diff --
    
    I'd like to see tests that cover both a "with cache" and a "no cache" use case.


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r183103619
  
    --- Diff: metron-stellar/stellar-common/src/main/java/org/apache/metron/stellar/common/CachingStellarProcessor.java ---
    @@ -0,0 +1,140 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.metron.stellar.common;
    +
    +import com.github.benmanes.caffeine.cache.Cache;
    +import com.github.benmanes.caffeine.cache.Caffeine;
    +import org.apache.metron.stellar.common.utils.ConversionUtils;
    +import org.apache.metron.stellar.dsl.Context;
    +import org.apache.metron.stellar.dsl.VariableResolver;
    +import org.apache.metron.stellar.dsl.functions.resolver.FunctionResolver;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +import java.util.Optional;
    +import java.util.Set;
    +import java.util.concurrent.TimeUnit;
    +
    +/**
    + * The Caching Stellar Processor is a stellar processor that optionally fronts stellar with an expression-by-expression
    + * LFU cache.
    + */
    +public class CachingStellarProcessor extends StellarProcessor {
    --- End diff --
    
    Where are the unit tests for the `CachingStellarProcessor`?


---

[GitHub] metron issue #990: METRON-1520: Add caching for stellar field transformation...

Posted by cestella <gi...@git.apache.org>.
Github user cestella commented on the issue:

    https://github.com/apache/metron/pull/990
  
    This PR provides an expression cache for stellar when using the stellar field transformations in parsers.  Specifically, we found that without caching we were substantially stymied when performance tuning enrichments.  Thus, other places where stellar is in the critical path might derive benefit from similar caching in Enrichments.
    A couple of points about the implementation:
    * We use the same underlying caching strategy as in enrichments (caffeine-based least frequently used cache).
    * This caching is implemented in stellar core as an alternative to the StellarProcessor.  This way it is reusable elsewhere that stellar is used eventually.
    * This caching is slightly different than the caching in enrichments in that:
      * Enrichment caching is done regardless of adapter (stellar enrichments uses the same cache as the hbase enrichments)
      * Enrichment caching for stellar is at the group level (e.g. the key is the group of stellar expressions and the input variables used in the group adn the value is the set of results for that group given the input).  This implementation is at the individual expression level.


---

[GitHub] metron pull request #990: METRON-1520: Add caching for stellar field transfo...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/metron/pull/990#discussion_r180880448
  
    --- Diff: metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java ---
    @@ -99,6 +98,7 @@ public void configure(Map<String, Object> config) {
         String strategyStr = (String) config.getOrDefault(MAP_STRATEGY_CONFIG, MapStrategy.DROP.name());
         mapStrategy = MapStrategy.valueOf(strategyStr);
         if (config.containsKey(JSONP_QUERY)) {
    --- End diff --
    
    Why was this necessary? 


---