You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "David M. Lyle (JIRA)" <ji...@apache.org> on 2016/11/03 13:44:58 UTC

[jira] [Commented] (METRON-517) Update elasticsearch bro templates for uri

    [ https://issues.apache.org/jira/browse/METRON-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632750#comment-15632750 ] 

David M. Lyle commented on METRON-517:
--------------------------------------

I like #4 plus a few things:

1) Allow a configurable max_length for indexing.
2) Document how search will be impacted (search strings should be lower-case, etc.)
3) We should (maybe not necessarily as part of this effort) provide a guilde for searching the hdfs json files.

Though, #3 would be fine as long as we gave an indication that the indexed record was truncated via alert or index field.

> Update elasticsearch bro templates for uri
> ------------------------------------------
>
>                 Key: METRON-517
>                 URL: https://issues.apache.org/jira/browse/METRON-517
>             Project: Metron
>          Issue Type: Bug
>            Reporter: Jon Zeolla
>            Assignee: Jon Zeolla
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The bro uri field in [HTTP::Info](https://www.bro.org/sphinx/scripts/base/protocols/http/main.bro.html#type-HTTP::Info) can exceed the Lucene-imposed limit of 32766 per term (non-analyzed fields are treated as a single term, and we are setting it as not_analyzed here - https://github.com/apache/incubator-metron/blob/master/metron-deployment/roles/metron_elasticsearch_templates/files/es_templates/bro_index.template).  The resolution options that I've been able to find appear to be:
> 1. Set analyzed to "[no](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html)", which will not add that field to the index, making it not queryable.
> 2. Change the type to [binary](https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html), which will not store it by default.
> 3. Use "[ignore_above](https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html)" to set a limit, above which strings are not indexed.
> 4. Set the field as "analyzed".  
> Here is an example error message:
> ```
> [4]: index [bro_index_2016.10.25.21], type [bro_doc], id [AVf-iCuooLg3mHEm2PpH], message [java.lang.IllegalArgumentException: Document contains at least one immense term in field="uri" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[<redacted>]...', original message: bytes can be at most 32766 in length; got 38623]
> ```
> Relevant Lucene documentation:  https://lucene.apache.org/core/6_2_1/core/constant-values.html#org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)