You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Tellier Benoit (JIRA)" <ji...@apache.org> on 2017/06/27 06:12:00 UTC

[jira] [Commented] (MAILBOX-301) Lucene terms length exceeded on some emails

    [ https://issues.apache.org/jira/browse/MAILBOX-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064335#comment-16064335 ] 

Tellier Benoit commented on MAILBOX-301:
----------------------------------------

https://github.com/linagora/james-project/pull/864 proposes tests and fixes for this issue

> Lucene terms length exceeded on some emails
> -------------------------------------------
>
>                 Key: MAILBOX-301
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-301
>             Project: James Mailbox
>          Issue Type: Bug
>          Components: elasticsearch
>    Affects Versions: master
>            Reporter: Tellier Benoit
>             Fix For: master
>
>
> Lucene supports a maximum term size of 32KB
> This term size can get exceeded, causing the index to fail.
> Thus, the team had position "ignore_above" filters to filter out too long terms and positionned it's value to Lucene maximum.
> However, as stated in https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html :
> {code:java}
> Note:
> The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.
> {code}
> Thus the maximum value is computed for string length in ES and not based on bytes length in Lucene.
> We can craft a char sequence in UTF-8 exceeding the Lucene value but not triggering the ES limit.
> A much lower value (like 4KB) seems more reasonable, as long terms my not be significant.
> Note:
>  - Implement tests:
>     - Demonstrating this bug
>     - Demonstrating only too long terms are ignored



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org