You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by THADC <ti...@gmail.com> on 2018/05/01 14:40:23 UTC

Error when indexing against a specific dynamic field type

Hello,

We are migrating from solr 4.7 to 7.3. When I encounter a data item that
matches a custom dynamic field from our 4.7 schema:

*<dynamicField name="*_tsing"  type="alphaOnlySort"    indexed="true" 
stored="true" multiValued="false"/>*

, I get the following exception:

*Exception writing document id FULL_36265 to the index; possible analysis
error: Document contains at least one immense term in
field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
32766), all of which were skipped.  Please correct the analyzer to not
produce such terms.  The prefix of the first immense term is: '[108, 111,
114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
message: bytes can be at most 32766 in length; got 68144.*

Any ideas are greatly appreciated. Thank you.







--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Error when indexing against a specific dynamic field type

Posted by Steve Rowe <sa...@gmail.com>.
The input in the error message starts “lorem ipsum”, so it contains spaces, but the alphaOnlySort field type (in Solr’s example schemas anyway) uses KeywordTokenizer, which tokenizes the entire input as a single token.

As Erick implied, you maybe should not be doing that with this kind of data - perhaps the analyzer used by this dynamic field should change?

Alternatively, you could:

a) truncate long values so that a prefix makes it through the indexing process, e.g. by adding TruncateTokenFilterFactory[1] to alphaOnlySort’s analyzer, or by adding TruncateFieldUpdateProcessorFactory[2] to your update request processor chain; or

b) entirely eliminate overly long values, e.g. using LengthFilterFactory[3].

[1] https://lucene.apache.org/core/7_3_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilterFactory.html
[2] https://lucene.apache.org/solr/7_3_0/solr-core/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html
[3] https://lucene.apache.org/solr/guide/7_3/filter-descriptions.html#length-filter

--
Steve
www.lucidworks.com

> On May 1, 2018, at 11:28 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> You're sending it a huge term. My guess is you're sending something
> like base64-encoded data or perhaps just a single unbroken string in
> your field.
> 
> Examine your document, it should jump out at you.
> 
> Best,
> Erick
> 
> On Tue, May 1, 2018 at 7:40 AM, THADC <ti...@gmail.com> wrote:
>> Hello,
>> 
>> We are migrating from solr 4.7 to 7.3. When I encounter a data item that
>> matches a custom dynamic field from our 4.7 schema:
>> 
>> *<dynamicField name="*_tsing"  type="alphaOnlySort"    indexed="true"
>> stored="true" multiValued="false"/>*
>> 
>> , I get the following exception:
>> 
>> *Exception writing document id FULL_36265 to the index; possible analysis
>> error: Document contains at least one immense term in
>> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
>> 32766), all of which were skipped.  Please correct the analyzer to not
>> produce such terms.  The prefix of the first immense term is: '[108, 111,
>> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
>> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
>> message: bytes can be at most 32766 in length; got 68144.*
>> 
>> Any ideas are greatly appreciated. Thank you.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Error when indexing against a specific dynamic field type

Posted by Erick Erickson <er...@gmail.com>.
Steve's comment is much more germane. KeywordTokenizer,
used in alphaOnlySort last I knew is not appropriate at all.
Do you really want single tokens that consist of the entire
document for sorting purposes? Wouldn't the first 1K be enough?

It looks like this was put in in 4.0, so I'm guessing your analysis chain
is different now between the two versions.

It doesn't really matter though, this is not going to be changed.
You'll have to do something about your long fields or your
analysis chain. And/or revisit what you hope to accomplish
with using that field type on such a field, I'm almost certain
your use case is flawed.

Best,
Erick




On Tue, May 1, 2018 at 10:35 AM, THADC
<ti...@gmail.com> wrote:
> Erick, thanks for the response. I have a number of documents in our database
> where solr is throwing the same exception against *_tsing types.
>
> However, when I index against the same document with our solr 4.7, it is
> successfully indexed. So, I assume something is different between 4.7 and
> 7.3. I was assuming I could adjust the dynamic field somehow so that it
> indexes against these documents without errors when using 7.3.
>
> I can't remove the offending documents. Its my customer's data
>
> Is there some adjustment I can make to the dynamic field?
>
> Thanks again.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Error when indexing against a specific dynamic field type

Posted by THADC <ti...@gmail.com>.
Erick, thanks for the response. I have a number of documents in our database
where solr is throwing the same exception against *_tsing types.

However, when I index against the same document with our solr 4.7, it is
successfully indexed. So, I assume something is different between 4.7 and
7.3. I was assuming I could adjust the dynamic field somehow so that it
indexes against these documents without errors when using 7.3.

I can't remove the offending documents. Its my customer's data 

Is there some adjustment I can make to the dynamic field?

Thanks again.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Error when indexing against a specific dynamic field type

Posted by Erick Erickson <er...@gmail.com>.
You're sending it a huge term. My guess is you're sending something
like base64-encoded data or perhaps just a single unbroken string in
your field.

Examine your document, it should jump out at you.

Best,
Erick

On Tue, May 1, 2018 at 7:40 AM, THADC <ti...@gmail.com> wrote:
> Hello,
>
> We are migrating from solr 4.7 to 7.3. When I encounter a data item that
> matches a custom dynamic field from our 4.7 schema:
>
> *<dynamicField name="*_tsing"  type="alphaOnlySort"    indexed="true"
> stored="true" multiValued="false"/>*
>
> , I get the following exception:
>
> *Exception writing document id FULL_36265 to the index; possible analysis
> error: Document contains at least one immense term in
> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
> 32766), all of which were skipped.  Please correct the analyzer to not
> produce such terms.  The prefix of the first immense term is: '[108, 111,
> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
> message: bytes can be at most 32766 in length; got 68144.*
>
> Any ideas are greatly appreciated. Thank you.
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Error when indexing against a specific dynamic field type

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/1/2018 8:40 AM, THADC wrote:
> I get the following exception:
>
> *Exception writing document id FULL_36265 to the index; possible analysis
> error: Document contains at least one immense term in
> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
> 32766), all of which were skipped.  Please correct the analyzer to not
> produce such terms.  The prefix of the first immense term is: '[108, 111,
> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
> message: bytes can be at most 32766 in length; got 68144.*
>
> Any ideas are greatly appreciated. Thank you.

The error is not ambiguous.  It tells you precisely what the problem
is.  A single term in a Lucene index cannot be longer than about 32K,
that one has a term that's more than twice that size.

I'm guessing that the fieldType named alphaOnlySort is one of two
things:  Either the StrField class, or the TextField class with the
keyword tokenizer factory.

To fix this problem you will need to either reduce the size of the input
on the field, or use an analysis chain that splits the input into
smaller tokens.  It appears that the input string is comma separated
numbers, which probably should be tokenized, not treated as a single term.

Thanks,
Shawn