You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by nickwallen <gi...@git.apache.org> on 2018/11/16 20:54:24 UTC

[GitHub] metron pull request #1269: METRON-1677 UUIDv4 GUID is not Lucene friendly

GitHub user nickwallen opened a pull request:

    https://github.com/apache/metron/pull/1269

    METRON-1677 UUIDv4 GUID is not Lucene friendly

    With this change, when documents are written to Elasticsearch the document ID is no longer set as the Metron GUID, but instead left unset so that Elasticsearch can auto-generate it.  Doing this improves write performance into Elasticsearch.
    
    This will also be the case for any Lucene based Indexer, including Solr.  This work only covers Elasticsearch, but the same should be done for Solr as part of a separate effort.
    
    While the default table view looks the same, in the following screenshot I customized the table to show both the document ID and the GUID. 
    
    This change is dependent on the following open pull requests.
    - [ ] #1247
    - [ ] #1254 
    - [ ] #1259  
    
    ## Changes
    
    * The `ElasticsearchRetrieveLatestDao` was updated since the GUID is no longer the document ID.  This instead does a terms query on the GUID field instead of an ID query.
    
    * The `Document` class now contains an optional documentID field.  If the `Document` is retrieved from one of the DAOs this field will be populated.  When creating a new document, this field will be empty.
    
    * Many of the integrations tests had to be updated because the GUID and document ID are now different.
    
    * The Alert UI was updated so that it visually looks the same. By default, the Metron GUID is still shown as one of the first columns in the table.  
    
        * The table is actually showing the document's GUID field instead of the document ID as it was before.  The ID field remains, which contains the document ID generated by Elasticsearch.  The user can choose to add this to the table, if they like.
    
    
    ## Testing
    
    1. Spin-up Full Dev.
    
    1. Open up the Alerts UI and perform the following basic actions.
        * Search for alerts
        * Escalate an alert
        * Comment on an alert
        * Delete a comment from an alert
        * Create a meta-alert
        * Escalate a meta-alert
    
    1. Click on the configure wheel and add the 'id' field to the table view.  This will now display both the GUID and document ID in the table. They of course will be different.
    
    1.  Click on the 'guid' field in any row to filter the search results by the guid.
        ![screen shot 2018-11-14 at 2 44 21 pm](https://user-images.githubusercontent.com/2475409/48646597-53433300-e9b7-11e8-870b-f061af8cca47.png)
    
    1. Click on the 'id' field to filter the search results by the document ID.
        ![screen shot 2018-11-14 at 2 44 08 pm](https://user-images.githubusercontent.com/2475409/48646566-3a3a8200-e9b7-11e8-8370-f596346d4a62.png)
    
    1. Group by some fields to drill into the data.  In the tree view, click on the 'guid' column and ensure the data sorts correctly.  Do the same for the 'id' column that was added.
        ![screen shot 2018-11-14 at 2 46 43 pm](https://user-images.githubusercontent.com/2475409/48646519-1414e200-e9b7-11e8-96a2-50c568d909b2.png)
    
    ## Pull Request Checklist
    
    - [ ] Is there a JIRA ticket associated with this PR? If not one needs to be created at [Metron Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
    - [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    - [ ] Has your PR been rebased against the latest commit within the target branch (typically master)?
    - [ ] Have you included steps to reproduce the behavior or problem that is being changed or addressed?
    - [ ] Have you included steps or a guide to how the change may be verified and tested manually?
    - [ ] Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
    - [ ] Have you written or updated unit tests and or integration tests to verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
    - [ ] Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nickwallen/metron METRON-1677

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/1269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1269
    
----
commit a7c7dc287b4f9c99c6780b934a0b6f433a03aa04
Author: cstella <ce...@...>
Date:   2018-10-09T00:06:52Z

    Casey Stella - elasticsearch rest client migration base work

commit 10410ea9718a2a1b1d287fb4f22a6c98efb1fdaa
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-09T00:07:22Z

    Update shade plugin version

commit a33a16872118175ed35729df6ddde2959e49ae2f
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-09T15:56:08Z

    Fix es update dao test

commit 52c3c96d7657205e65d3bd0c0e35923a851911da
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-09T21:26:16Z

    Merge with master. Fix es search integration tests

commit 4742832869f9512e11cab4a32109c15a2b17a92e
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-11T18:45:06Z

    Merge branch 'master' into es-rest-client

commit 43809968320e586fd70411776140f3aa13a60195
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-11T23:59:25Z

    Get shade plugin working with the new ES client and the ClassIndexTransformer Shade plugin transformer.

commit af03f6f036e96c742db39733bf8ebc2cbf229129
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-19T02:56:52Z

    Introduce config classes for managing ES client configuration. Translate properties for new client.

commit 1c8eac2a173e1e24afc9e11163e456a9d90c93db
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-23T22:21:32Z

    Resolve merge conflicts with master

commit 1a47ded7a36f9d391227973c7da2921305373283
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-23T22:23:04Z

    Remove extra deps in metron-elasticsearch around log4j.

commit 54870d68e6f43e859367879bff7537f97c11d0bd
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-24T00:33:25Z

    Fixes for dep version issues.

commit 554de87ae160aae1ae85afb3dbb01a220a9d6838
Author: Nick Allen <ni...@...>
Date:   2018-10-25T18:54:41Z

    METRON-1845 Correct Test Data Load in Elasticsearch Integration Tests

commit 3b95d1ece278bf517642a0b0a8eee5a7ea021b9e
Author: Nick Allen <ni...@...>
Date:   2018-10-25T19:06:39Z

    Removed dead function

commit 7d9ee2563fdcc0e6b0a3d5106ea568577f21d303
Author: Nick Allen <ni...@...>
Date:   2018-10-26T18:55:03Z

    Part way through Solr changes that I am going to roll-back

commit e6993867ebd00a5c09ef1a4da772664a11d1f212
Author: Nick Allen <ni...@...>
Date:   2018-10-26T18:55:15Z

    Revert "Part way through Solr changes that I am going to roll-back"
    
    This reverts commit 7d9ee2563fdcc0e6b0a3d5106ea568577f21d303.

commit dd024b818b238e24430385914cef6ea12873bfc6
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-26T19:21:37Z

    Merge branch 'master' into es-rest-client

commit 4a7a8370e61e427ac790d2e21bc43a55ac45636b
Author: Nick Allen <ni...@...>
Date:   2018-10-26T19:29:09Z

    Resolved integration test differences with Solr

commit ebb7ef57c9fb20fb957b86516734fed1c18cda84
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-26T21:10:54Z

    Start addressing pr feedback.

commit 35811175824415c1e37418bb2efb347e6f7a801a
Author: Nick Allen <ni...@...>
Date:   2018-10-29T16:10:05Z

    METRON-1849 Added integration test for UpdateDao.update() and UpdateDao.batchUpdate()

commit b7ac957164418d733e6c8fb8aa896186bbc5176f
Author: Michael Miklavcic <mi...@...>
Date:   2018-10-30T23:30:08Z

    Addressing review feedback

commit c0faab2aaf04475ffc9e5558018b538fb62a3c7b
Author: Nick Allen <ni...@...>
Date:   2018-10-31T15:43:47Z

    METRON-1849 Progress on shared write logic

commit 021810a0edb116c3b9bcf12cb8a77bf4197f24fb
Author: Nick Allen <ni...@...>
Date:   2018-10-31T20:51:32Z

    METRON-1849 Progress on tests, still integration tests to fix-up

commit 056497034f7d0b5ebba7d8a66f1b6ff66ebfd119
Author: Nick Allen <ni...@...>
Date:   2018-11-01T13:57:06Z

    METRON-1845 Added license header

commit f1c8ce4eb5d404f2a1d99e51e3111586221e13a0
Author: Nick Allen <ni...@...>
Date:   2018-11-01T21:36:56Z

    Most tests working except for a few

commit 3e5c12169e8cea5da5349cf699b5519cdc82bca1
Author: Michael Miklavcic <mi...@...>
Date:   2018-11-02T04:56:55Z

    Refactor clientfactory. Updates to metron-common, metron-elasticsearch, and kerberos/security documentation

commit d4b02ce80a6c72cdf4595f9a47a4955f59b0a5e4
Author: Nick Allen <ni...@...>
Date:   2018-11-02T20:12:52Z

    Fixed all tests

commit fb5610b0aba13cdef7b6440e32e2c838bee3babb
Author: Nick Allen <ni...@...>
Date:   2018-11-02T20:22:09Z

    Merge remote-tracking branch 'mike/es-rest-client' into METRON-1849

commit c3b28363e87f75e97bdc9ab14d30156136080af7
Author: Nick Allen <ni...@...>
Date:   2018-11-02T20:27:42Z

    Fixed how client is created

commit 4539f43c43334696e33a861e7cbbb76a83c217b6
Author: Nick Allen <ni...@...>
Date:   2018-11-02T21:55:12Z

    Improved docs

commit 32467b375f6e6fa3ac268fc0e5f4a6a447a48f76
Author: Nick Allen <ni...@...>
Date:   2018-11-05T16:59:26Z

    Small cleanup

commit 28f5a9700cfc0d80cc6159e4614e23fe2b530c1e
Author: Nick Allen <ni...@...>
Date:   2018-11-05T23:39:43Z

    Removed the IndexedDocument class to simplify things

----


---

[GitHub] metron pull request #1269: METRON-1879 Allow Elasticsearch to Auto-Generate ...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1269#discussion_r237852978
  
    --- Diff: metron-platform/metron-elasticsearch/src/main/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriter.java ---
    @@ -56,90 +58,107 @@
        */
       private transient ElasticsearchClient client;
     
    +  /**
    +   * Responsible for writing documents.
    +   *
    +   * <p>Uses a {@link TupleBasedDocument} to maintain the relationship between
    +   * a {@link Tuple} and the document created from the contents of that tuple. If
    +   * a document cannot be written, the associated tuple needs to be failed.
    +   */
    +  private transient BulkDocumentWriter<TupleBasedDocument> documentWriter;
    +
       /**
        * A simple data formatter used to build the appropriate Elasticsearch index name.
        */
       private SimpleDateFormat dateFormat;
     
    -
       @Override
       public void init(Map stormConf, TopologyContext topologyContext, WriterConfiguration configurations) {
    -
         Map<String, Object> globalConfiguration = configurations.getGlobalConfig();
    -    client = ElasticsearchClientFactory.create(globalConfiguration);
         dateFormat = ElasticsearchUtils.getIndexFormat(globalConfiguration);
    +
    +    // only create the document writer, if one does not already exist. useful for testing.
    +    if(documentWriter == null) {
    +      client = ElasticsearchClientFactory.create(globalConfiguration);
    +      documentWriter = new ElasticsearchBulkDocumentWriter<>(client);
    +    }
       }
     
       @Override
    -  public BulkWriterResponse write(String sensorType, WriterConfiguration configurations, Iterable<Tuple> tuples, List<JSONObject> messages) throws Exception {
    +  public BulkWriterResponse write(String sensorType,
    +                                  WriterConfiguration configurations,
    +                                  Iterable<Tuple> tuplesIter,
    +                                  List<JSONObject> messages) {
     
         // fetch the field name converter for this sensor type
         FieldNameConverter fieldNameConverter = FieldNameConverters.create(sensorType, configurations);
    +    String indexPostfix = dateFormat.format(new Date());
    +    String indexName = ElasticsearchUtils.getIndexName(sensorType, indexPostfix, configurations);
    +
    +    // the number of tuples must match the number of messages
    +    List<Tuple> tuples = Lists.newArrayList(tuplesIter);
    +    int batchSize = tuples.size();
    +    if(messages.size() != batchSize) {
    +      throw new IllegalStateException(format("Expect same number of tuples and messages; |tuples|=%d, |messages|=%d",
    +              tuples.size(), messages.size()));
    +    }
     
    -    final String indexPostfix = dateFormat.format(new Date());
    -    BulkRequest bulkRequest = new BulkRequest();
    -    for(JSONObject message: messages) {
    +    // create a document from each message
    +    List<TupleBasedDocument> documents = new ArrayList<>();
    +    for(int i=0; i<tuples.size(); i++) {
    +      JSONObject message = messages.get(i);
    +      Tuple tuple = tuples.get(i);
     
    -      JSONObject esDoc = new JSONObject();
    +      // transform the message fields to the source fields of the indexed document
    +      JSONObject source = new JSONObject();
           for(Object k : message.keySet()){
    -        copyField(k.toString(), message, esDoc, fieldNameConverter);
    +        copyField(k.toString(), message, source, fieldNameConverter);
           }
     
    -      String indexName = ElasticsearchUtils.getIndexName(sensorType, indexPostfix, configurations);
    -      IndexRequest indexRequest = new IndexRequest(indexName, sensorType + "_doc");
    -      indexRequest.source(esDoc.toJSONString());
    -      String guid = (String)esDoc.get(Constants.GUID);
    -      if(guid != null) {
    -        indexRequest.id(guid);
    +      // define the document id
    +      String guid = String.class.cast(source.get(Constants.GUID));
    +      if(guid == null) {
    +        LOG.info("Missing '{}' field; document ID will be auto-generated.", Constants.GUID);
           }
     
    -      Object ts = esDoc.get("timestamp");
    -      if(ts != null) {
    -        indexRequest.timestamp(ts.toString());
    +      // define the document timestamp
    +      Long timestamp = Long.class.cast(source.get(Constants.Fields.TIMESTAMP.getName()));
    --- End diff --
    
    @anandsubbu pointed out to me that this previously worked when the timestamp is a String.  The code now requires a Long.  I need to address this point, although I am not sure how it should be right now.  I need to do a little research.  Thanks @anandsubbu !


---

[GitHub] metron issue #1269: METRON-1879 Allow Elasticsearch to Auto-Generate the Doc...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/1269
  
    Prior to this change we use the Metron-generated GUID as the documentID.  They were always the same.  There were various places in the code where the name `guid`, `docID`, or `id` were used that implicitly meant the Metron GUID.
    
    With this change, the Metron-generated guid will never be equal to the Elasticsearch-generated doc ID. You can view and use either the guid or the doc ID in the UI.  Both are available as separate fields in the index.
    
    This PR has pulled on those strings and tried to make clear what is a Metron GUID and what is a document ID.  In the Alerts UI you can see and/or use either the GUID or the document ID.  It is up to the user which one they care to see.  Although most users will not care what the document ID is.
    
    From my PR description:
    
    > This change is backwards compatible. The Alerts UI should continue to work no matter if some of the underlying indices were written with the Metron GUID as the document ID and some are written with the auto-generated document ID.
    
    There is no option to continue to use the GUID as the documentID.  I can't think of a use case worthy of the additional effort and testing needed to support that.
    
    It is backwards compatible in that the Alerts UI will work when searching over both "legacy" indices where guid = docID and "new" indices where guid != docID.  All places in the code where a docID is needed, that docID is actually first retrieved from Elasticsearch, rather than making an assumption about what it is.
    
    
    



---

[GitHub] metron issue #1269: METRON-1879 Allow Elasticsearch to Auto-Generate the Doc...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/1269
  
    I was doing some final testing and noticed that the 'guid' field for metaalerts is showing the document ID, rather than the guid.  I need to fix the presentation of metaalerts in the UI.
    
    ![screen shot 2018-11-19 at 10 56 19 am](https://user-images.githubusercontent.com/2475409/48718694-e6fc4580-ebe9-11e8-9eeb-2650767b1f55.png)



---

[GitHub] metron issue #1269: METRON-1879 Allow Elasticsearch to Auto-Generate the Doc...

Posted by mmiklavc <gi...@git.apache.org>.
Github user mmiklavc commented on the issue:

    https://github.com/apache/metron/pull/1269
  
    Refresh me on this - prior to this change, did we provide both a doc id and a guid that were identical? And now with this change you would expect to see guid != doc id (bc guid generated by Metron, doc id now generated by ES), except where otherwise configured for backwards compatibility?


---

[GitHub] metron issue #1269: METRON-1879 Allow Elasticsearch to Auto-Generate the Doc...

Posted by nickwallen <gi...@git.apache.org>.
Github user nickwallen commented on the issue:

    https://github.com/apache/metron/pull/1269
  
    This issue has been fixed.


---