You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "Nick Allen (JIRA)" <ji...@apache.org> on 2018/04/24 13:35:00 UTC

[jira] [Comment Edited] (METRON-1538) Don't use GUIDS for Elastic document id, but autogenerated ID's for performance

    [ https://issues.apache.org/jira/browse/METRON-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449843#comment-16449843 ] 

Nick Allen edited comment on METRON-1538 at 4/24/18 1:34 PM:
-------------------------------------------------------------

The idea of a uniqueness check for a GUID/UUID is foreign to me.  Elasticsearch likely performs this check because in many cases a user provided ID will not be a GUID.  A user might choose another domain specific value where the likeliness of collision is much higher.  In our specific scenario, we are using a GUID/UUID.  There is no need to check for uniqueness.

We could fairly easily make this configurable so that the end user can make the right decision for their environment.  The ElasticsearchWriter would accept a parameter which defines the name of the field to use as the document ID.  If the field is defined, the writer extracts that value from the message and uses that as the document ID.  If the field is left undefined, empty or null, no document ID is defined by the ElasticsearchWriter which would allow ES to auto-generate the ID.  The indexed document would still contain Metron's GUID for cross-correlation.

[Currently, the Metron GUID is always used as the document ID, if it exists.|https://github.com/apache/metron/blob/a8b555dcc9f548d7b91789a46d9435b4d8b17581/metron-platform/metron-elasticsearch/src/main/java/org/apache/metron/elasticsearch/writer/ElasticsearchWriter.java#L73-L76]

 


was (Author: nickwallen):
The idea of a uniqueness check for a GUID/UUID is foreign to me.  Elasticsearch likely performs this check because in many cases a user provided ID will not be a GUID.  A user might choose another domain specific value where the likeliness of collision is much higher.  In our specific scenario, we are using a GUID/UUID.  There is no need to check for uniqueness.

We could fairly easily make this configurable so that the end user can make the right decision for their environment.  The ElasticsearchWriter would accept a parameter which defines the name of the field to use as the document ID.  If the field is defined, the writer extracts that value from the message and uses that as the document ID.  If the field is left undefined, empty or null, no document ID is defined by the ElasticsearchWriter which would allow ES to auto-generate the ID.  The indexed document would still contain Metron's GUID for cross-correlation.

 

> Don't use GUIDS for Elastic document id, but autogenerated ID's for performance
> -------------------------------------------------------------------------------
>
>                 Key: METRON-1538
>                 URL: https://issues.apache.org/jira/browse/METRON-1538
>             Project: Metron
>          Issue Type: Improvement
>    Affects Versions: 0.4.3
>            Reporter: Ward Bekker
>            Priority: Major
>              Labels: performance
>
> Metron currently uses GUIDS for ES document Ids, this goes against the best practice:
> "When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster."
> [https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-]speed.html#_use_auto_generated_ids



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)