You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "Simon Elliston Ball (JIRA)" <ji...@apache.org> on 2018/04/24 12:52:00 UTC

[jira] [Commented] (METRON-1538) Don't use GUIDS for Elastic document id, but autogenerated ID's for performance

    [ https://issues.apache.org/jira/browse/METRON-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449783#comment-16449783 ] 

Simon Elliston Ball commented on METRON-1538:
---------------------------------------------

This is not necessarily a good idea. We use guids for message amendments, and meta alerting. The guids therefore need to be generated in the Metron pipeline, otherwise guids will not match in ES and HDFS indices, leading to total data mis-match. 

What we could do is to have a useless id created in ES in addition to the guid. This would add storage overhead, and hurt performance on meta alerts and search lookup, but may not be as significant in the ingest impact of non-auto-generated keys. Arguably this scenario risks data corruption in ES unless we perform the same uniqueness checks anyway, but that may be something that can be resolved, or accepted as a small front-end event duplication risk in short term indices.

> Don't use GUIDS for Elastic document id, but autogenerated ID's for performance
> -------------------------------------------------------------------------------
>
>                 Key: METRON-1538
>                 URL: https://issues.apache.org/jira/browse/METRON-1538
>             Project: Metron
>          Issue Type: Improvement
>    Affects Versions: 0.4.3
>            Reporter: Ward Bekker
>            Priority: Major
>              Labels: performance
>
> Metron currently uses GUIDS for ES document Ids, this goes against the best practice:
> "When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster."
> [https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-]speed.html#_use_auto_generated_ids



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)