You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Ryan <mr...@moreover.com> on 2014/07/01 15:49:13 UTC

Best way to fix "Document contains at least one immense term"?

In LUCENE-5472, Lucene was changed to throw an error if a term is too long, rather than just logging a message. I have fields with terms that are too long, but I don't care - I just want to ignore them and move on.

The recommended solution in the docs is to use LengthFilterFactory, but this limits the terms by the number of characters, rather than the number of UTF-8 bytes. So you can't just do something clever like set max=32766, due to the possibility of multibyte characters.

So, is there a way of using LengthFilterFactory to do this such that an error will never be thrown? Thinking I could use some max less than 32766 / 3, but I want to be absolutely sure that there is not some edge case that is going to break. I guess I could just set it to something sane like 1000. Or is there another more direct solution to this problem?

-Michael

RE: Best way to fix "Document contains at least one immense term"?

Posted by Michael Ryan <mr...@moreover.com>.

In this particular case, the fields are just using KeywordTokenizerFactory. I have other fields that are tokenized, but they use tokenizers with a short maxTokenLength.

I'm not even all that concerned about my own data, but more curious if there's a general solution to this problem. I imagine there other people who just want Solr to make a best attempt at indexing the data, and if it has to throw away some fields/terms (preferably with a warning logged), that's fine, just don't reject the whole document.

-Michael

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Tuesday, July 01, 2014 5:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Best way to fix "Document contains at least one immense term"?

You could develop an update processor to skip or trim long terms as you see fit. You can even code a script in JavaScruipt using the stateless script update processor.

Can you tell us more about the nature of your data? I mean, sometimes analyzer filters strip or fold accented characters anyway, so count of characters versus UTF-8 bytes may be a non-problem.

-- Jack Krupansky

-----Original Message-----
From: Michael Ryan
Sent: Tuesday, July 1, 2014 9:49 AM
To: solr-user@lucene.apache.org
Subject: Best way to fix "Document contains at least one immense term"?

In LUCENE-5472, Lucene was changed to throw an error if a term is too long, rather than just logging a message. I have fields with terms that are too long, but I don't care - I just want to ignore them and move on.

The recommended solution in the docs is to use LengthFilterFactory, but this limits the terms by the number of characters, rather than the number of
UTF-8 bytes. So you can't just do something clever like set max=32766, due to the possibility of multibyte characters.

So, is there a way of using LengthFilterFactory to do this such that an error will never be thrown? Thinking I could use some max less than 32766 / 3, but I want to be absolutely sure that there is not some edge case that is going to break. I guess I could just set it to something sane like 1000. Or is there another more direct solution to this problem?

-Michael

Re: Best way to fix "Document contains at least one immense term"?

Posted by Jack Krupansky <ja...@basetechnology.com>.

You could develop an update processor to skip or trim long terms as you see 
fit. You can even code a script in JavaScruipt using the stateless script 
update processor.

Can you tell us more about the nature of your data? I mean, sometimes 
analyzer filters strip or fold accented characters anyway, so count of 
characters versus UTF-8 bytes may be a non-problem.

-- Jack Krupansky

-----Original Message----- 
From: Michael Ryan
Sent: Tuesday, July 1, 2014 9:49 AM
To: solr-user@lucene.apache.org
Subject: Best way to fix "Document contains at least one immense term"?

In LUCENE-5472, Lucene was changed to throw an error if a term is too long, 
rather than just logging a message. I have fields with terms that are too 
long, but I don't care - I just want to ignore them and move on.

The recommended solution in the docs is to use LengthFilterFactory, but this 
limits the terms by the number of characters, rather than the number of 
UTF-8 bytes. So you can't just do something clever like set max=32766, due 
to the possibility of multibyte characters.

So, is there a way of using LengthFilterFactory to do this such that an 
error will never be thrown? Thinking I could use some max less than 32766 / 
3, but I want to be absolutely sure that there is not some edge case that is 
going to break. I guess I could just set it to something sane like 1000. Or 
is there another more direct solution to this problem?

-Michael