You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2016/08/05 03:20:17 UTC

Dubious error message?

Trying to add a document, someone saw:

    java.lang.IllegalArgumentException: Document contains at least one
immense term in field="bcc-address" (whose UTF8 encoding is longer
than the max length 32766), all of which were skipped.  Please correct
the analyzer to not produce such terms.  The prefix of the first
immense term is: '[00, --omitted--]...', original message: bytes can
be at most 32766 in length; got 115597

Question 1: It says the bytes are being skipped, but to me "skipped"
means it's just going to continue, yet I get this exception. Is that
intentional?

Question 2: Can we turn this check off?

Question 2.1: Why limit in the first place? Every time I have ever
seen someone introduce a limit, it has only been a matter of time
until someone hits it, no matter how improbable it seemed when it was
put in.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Dubious error message?

Posted by Trejkaz <tr...@trypticon.org>.

On Fri, Aug 5, 2016 at 2:51 PM, Erick Erickson <er...@gmail.com> wrote:
> Question 2: Not that I know of
>
> Question 2.1. It's actually pretty difficult to understand why a single _term_
> can be over 32K and still make sense. This is not to say that a
> single _text_ field can't be over 32K, each term within that field
> is (usually) much less than that.
>
> Do you have a real-world use-case where you have a 115K term
> that can _only_ be matched by searching for exactly that
> sequence of 115K characters? Not substrings. Not wildcards. A
> "string" type (as opposed to anything based on solr.Textfield).

This particular field is used to store unique addresses, and for
precision reasons we wanted to search for addresses without tokenising
them, as if you tokenised them, bob@example.com could accidentally
match bob@example.com.au, even though they're two different people. It
also makes statistics faster to calculate.

Now, addresses in SMTP email are fairly short, limited to something
like 254 characters, but sometimes you get data that violates the
standard, and we store more than just that one kind of address, and
maybe one of the other sorts can be longer.

In this situation, it isn't clear whether you can truncate the data,
because if you truncate it, now two addresses are considered equal
when they're not the same string. But then again, if the old version
of Lucene was already truncating it, people might be fine with it
being truncated in the new version. But if they didn't know that,
there would definitely be someone who objects.

So I'm not really saying that the term "makes sense" - I'm just saying
we encountered it in real-world data, and an error occurred. Someone
then complained about the error.

> As far as the error message is concerned, that does seem somewhat opaque.
> Care to raise a JIRA on it (and, if you're really ambitious attach a patch)?

I'll see. :)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Dubious error message?

Posted by Erick Erickson <er...@gmail.com>.

Question 2: Not that I know of

Question 2.1. It's actually pretty difficult to understand why a single _term_
can be over 32K and still make sense. This is not to say that a
single _text_ field can't be over 32K, each term within that field
is (usually) much less than that.

Do you have a real-world use-case where you have a 115K term
that can _only_ be matched by searching for exactly that
sequence of 115K characters? Not substrings. Not wildcards. A
"string" type (as opposed to anything based on solr.Textfield).

As far as the error message is concerned, that does seem somewhat opaque.
Care to raise a JIRA on it (and, if you're really ambitious attach a patch)?

Best,
Erick

On Thu, Aug 4, 2016 at 8:20 PM, Trejkaz <tr...@trypticon.org> wrote:
> Trying to add a document, someone saw:
>
>     java.lang.IllegalArgumentException: Document contains at least one
> immense term in field="bcc-address" (whose UTF8 encoding is longer
> than the max length 32766), all of which were skipped.  Please correct
> the analyzer to not produce such terms.  The prefix of the first
> immense term is: '[00, --omitted--]...', original message: bytes can
> be at most 32766 in length; got 115597
>
> Question 1: It says the bytes are being skipped, but to me "skipped"
> means it's just going to continue, yet I get this exception. Is that
> intentional?
>
> Question 2: Can we turn this check off?
>
> Question 2.1: Why limit in the first place? Every time I have ever
> seen someone introduce a limit, it has only been a matter of time
> until someone hits it, no matter how improbable it seemed when it was
> put in.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org