You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "kunhu0404@gmail.com" <ku...@gmail.com> on 2018/08/28 12:03:17 UTC

Solr Indexing error

Hello All,

Need help on the error related to Solr indexing. We are using Solr 6.6.3 and
Nutch crawler 1.14. While indexing data to Solr we see errors as below

possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),
all of which were skipped.  Please correct the analyzer to not produce such
terms.  The prefix of the first immense term is: '[84, 69, 82, 77, 83, 32,
79, 70, 32, 85, 83, 69, 10, 69, 102, 102, 101, 99, 116, 105, 118, 101, 32,
68, 97, 116, 101, 58, 32, 74]...', original message: bytes can be at most
32766 in length; got 40638. Perhaps the document has an indexed string field
(solr.StrField) which is too large.

Can anyone please help






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Indexing error

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/28/2018 6:03 AM, kunhu0404@gmail.com wrote:
> possible analysis error: Document contains at least one immense term in
> field="content" (whose UTF8 encoding is longer than the max length 32766),

It's telling you exactly what is wrong.

The field named "content" is probably using a field class with no 
analysis, or using the Keyword Tokenizer so the whole field gets treated 
as a single term.  The length of that field for at least one of your 
documents is longer than 32766 characters. Maybe it's bytes -- a UTF8 
character can be more than a single byte.  Lucene has a limit on term 
length, and your input exceeded that length.

If you change the field type for content to something that's analyzed 
(split into words, basically) then this problem would likely go away.

Thanks,
Shawn