You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/12/20 15:41:18 UTC
DocumentsWriter.checkMaxTermLength issues
I am getting the following exception when running against trunk:
java.lang.IllegalArgumentException: at least one term (length 20079)
exceeds max term length 16383; these terms were skipped
at
org
.apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java:
1545)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1451)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)
....
I'm wondering if the IndexWriter should throw an explicit exception in
this case as opposed to a RuntimeException, as it seems to me really
long tokens should be handled more gracefully. It seems strange that
the message says the terms were skipped (which the code does in fact
do), but then there is a RuntimeException thrown which usually
indicates to me the issue is not recoverable. I am using the
StandardTokenizer, but I don't think that much matters.
Any thoughts on this?
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I am getting the following exception when running against trunk:
> java.lang.IllegalArgumentException: at least one term (length 20079)
> exceeds max term length 16383; these terms were skipped
> at
> org
> .apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java:
> 1545)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1451)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)
> ....
>
> I'm wondering if the IndexWriter should throw an explicit exception in
> this case as opposed to a RuntimeException, as it seems to me really
> long tokens should be handled more gracefully. It seems strange that
> the message says the terms were skipped (which the code does in fact
> do), but then there is a RuntimeException thrown which usually
> indicates to me the issue is not recoverable. I am using the
> StandardTokenizer, but I don't think that much matters.
>
> Any thoughts on this?
I think it's a good to bring attention to it and not sweep it under the rug.
It indicates potential issues or problems with analysis or the data.
The user can use a LengthFilter to explicitly throw long tokens away.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 20, 2007, at 10:55 AM, Yonik Seeley wrote:
> On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'm wondering if the IndexWriter should throw an explicit exception
>> in
>> this case as opposed to a RuntimeException,
>
> RuntimeExceptions can happen in analysis components during indexing
> anyway, so it seems like indexing code should deal with exceptions
> just to be safe. As long as exceptions happinging during indexing
> don't mess up the indexing code, everything should be OK.
>
>> as it seems to me really
>> long tokens should be handled more gracefully. It seems strange that
>> the message says the terms were skipped (which the code does in fact
>> do), but then there is a RuntimeException thrown which usually
>> indicates to me the issue is not recoverable.
>
> It does seem like the document shouldn't be added at all if it caused
> an exception.
> Is that what happens if one of the analyzers causes an exception to
> be thrown?
>
> The other option is to simply ignore tokens above 16K... I'm not sure
> what's right here.
+1. The code already does ignore them, that is why the exception
seems so weird. DocsWriter gracefully handles the problem, but then
throws up after the fact. I would vote to just log it or let the user
decide somehow.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:
> On Dec 20, 2007 11:15 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Though ... we could simply immediately delete the document when any
>> exception occurs during its processing. So if we think whenever any
>> doc hits an exception, then it should be deleted, it's not so hard to
>> implement that policy...
>
> It does seem like you only want documents in the index that didn't
> generate exceptions... otherwise it doesn't seem like you would know
> exactly what got indexed.
I agree -- I'll work on this.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
OK I will take this approach... create TermTooLongException
(subclasses RuntimeException), listed in the javadocs but not the
throws clause of add/updateDocument. DW throws this if it encounters
any term >= 16383 chars in length.
Whenever that exception (or others) are thrown from within DW, it
means that document will not be added to your index (well, perhaps
partially added and then deleted).
Probably won't get going on this one until early next year ... I'm
mostly offline from 12/22 - 1/1.
Mike
Yonik Seeley wrote:
> On Dec 20, 2007 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Makes sense. I wasn't sure if declaring new exceptions to be thrown
>> is violating back-compat. issues or not (even if they are runtime
>> exceptions)
>
> That's a good question... I know that declared RuntimeExceptions are
> contained in the bytecode (the method signature)... but I don't know
> if they need to match up exactly for things to work.
>
> To be safe I guess we should start out with it commented out (or just
> documented in the JavaDoc).
>
> -Yonik
>
>> On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:
>>
>>> On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>>> But, I can see the value in the throw the exception
>>>> case too, except I think the API should declare the exception is
>>>> being
>>>> thrown. It could throw an extension of IOException.
>>>
>>> To be robust, user indexing code needs to catch other types of
>>> exceptions that could be thrown from Anaylzers anyway.
>>>
>>> I don't think this exception (if we choose to keep it as an
>>> exception)
>>> fits in the class of IOException, where something is normally really
>>> wrong.
>>>
>>> We could declare addDocument() to throw something inherited from
>>> RuntimeException though, right?
>>>
>>> -Yonik
>>>
>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Makes sense. I wasn't sure if declaring new exceptions to be thrown
> is violating back-compat. issues or not (even if they are runtime
> exceptions)
That's a good question... I know that declared RuntimeExceptions are
contained in the bytecode (the method signature)... but I don't know
if they need to match up exactly for things to work.
To be safe I guess we should start out with it commented out (or just
documented in the JavaDoc).
-Yonik
> On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:
>
> > On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
> >> But, I can see the value in the throw the exception
> >> case too, except I think the API should declare the exception is
> >> being
> >> thrown. It could throw an extension of IOException.
> >
> > To be robust, user indexing code needs to catch other types of
> > exceptions that could be thrown from Anaylzers anyway.
> >
> > I don't think this exception (if we choose to keep it as an exception)
> > fits in the class of IOException, where something is normally really
> > wrong.
> >
> > We could declare addDocument() to throw something inherited from
> > RuntimeException though, right?
> >
> > -Yonik
> >
>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Grant Ingersoll <gs...@apache.org>.
Makes sense. I wasn't sure if declaring new exceptions to be thrown
is violating back-compat. issues or not (even if they are runtime
exceptions)
On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:
> On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> But, I can see the value in the throw the exception
>> case too, except I think the API should declare the exception is
>> being
>> thrown. It could throw an extension of IOException.
>
> To be robust, user indexing code needs to catch other types of
> exceptions that could be thrown from Anaylzers anyway.
>
> I don't think this exception (if we choose to keep it as an exception)
> fits in the class of IOException, where something is normally really
> wrong.
>
> We could declare addDocument() to throw something inherited from
> RuntimeException though, right?
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
> But, I can see the value in the throw the exception
> case too, except I think the API should declare the exception is being
> thrown. It could throw an extension of IOException.
To be robust, user indexing code needs to catch other types of
exceptions that could be thrown from Anaylzers anyway.
I don't think this exception (if we choose to keep it as an exception)
fits in the class of IOException, where something is normally really
wrong.
We could declare addDocument() to throw something inherited from
RuntimeException though, right?
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote:
>
> Yonik Seeley wrote:
>
>> On Dec 20, 2007 11:33 AM, Gabi Steinberg
>> <ga...@comcast.net> wrote:
>>> It might be a bit harsh to drop the document if it has a very long
>>> token
>>> in it.
>>
>> There is really two issues here.
>> For long tokens, one could either ignore them or generate an
>> exception.
>
> I can see the argument both ways. On the one hand, we want indexing
> to be robust/resilient, such that massive terms are quietly skipped
> (maybe w/ a log to infoStream if its set).
This would be fine for me. In some sense, it is just like applying
the LengthFilter, which removes tokens silently, too, but works for
all analyzers. But, I can see the value in the throw the exception
case too, except I think the API should declare the exception is being
thrown. It could throw an extension of IOException.
>
>
> On the other hand, clearly there is something seriously wrong when
> your analyzer is producing a single 16+ KB term, and so it would be
> nice to be brittle/in-your-face so the user is forced to deal with/
> correct the situation.
>
> Also, it's really bad once these terms pollute your index. EG
> suddenly the Terminfos index can easily take tremendous amounts of
> RAM, slow down indexing/merging/searching, etc. This is why
> LUCENE-1052 was created. It's alot better if you catch this up
> front then letting it pollute your index.
>
> If we want to take the "in your face" solution, I think the cutoff
> should be less than 16 KB (16 KB is just the hard limit inside DW).
>
>> For all exceptions generated while indexing a document (that are
>> passed through to the user)
>> it seems like that document should not be in the index.
>
> I like this disposition because it means the index is in a known
> state. It's bad to have partial docs in the index: it can only lead
> to more confusion as people try to figure out why some terms work
> for retrieving the doc but others don't.
>
> Mike
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Doron Cohen wrote:
> On Dec 31, 2007 7:54 PM, Michael McCandless
> <lu...@mikemccandless.com>
> wrote:
>
>> I actually think indexing should try to be as robust as possible.
>> You
>> could test like crazy and never hit a massive term, go into
>> production
>> (say, ship your app to lots of your customer's computers) only to
>> suddenly see this exception. In general it could be a long time
>> before
>> you "accidentally" our users see this.
>>
>> So I'm thinking we should have the default behavior, in IndexWriter,
>> be to skip immense terms?
>>
>> Then people can use TokenFilter to change this behavior if they want.
>>
>
> +1
OK I will take this approach.
> At first I saw this similar to IndexWriter.setMaxFieldLength(), but
> it was
> a wrong comparison, because #terms is a "real" indexing/serarch
> characteristic that many applications can benefit from being able
> to modify, whereas a huge token is in most cases a bug.
>
> Just to make sure on the scenario - the only change is to skip too
> long
> tokens, while any other exception is thrown (not ignored.)
Exactly. And, on any exception, we will immediately mark any
partially indexed doc as deleted.
> Also, for a skipped token I think the position increment of the
> following token should be incremented.
Good point; I'll make sure we do.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Doron Cohen <cd...@gmail.com>.
On Dec 31, 2007 7:54 PM, Michael McCandless <lu...@mikemccandless.com>
wrote:
> I actually think indexing should try to be as robust as possible. You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception. In general it could be a long time before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.
>
+1
At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was
a wrong comparison, because #terms is a "real" indexing/serarch
characteristic that many applications can benefit from being able
to modify, whereas a huge token is in most cases a bug.
Just to make sure on the scenario - the only change is to skip too long
tokens, while any other exception is thrown (not ignored.)
Also, for a skipped token I think the position increment of the
following token should be incremented.
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 12:54 PM, Michael McCandless <lu...@mikemccandless.com> wrote:
> I actually think indexing should try to be as robust as possible. You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception. In general it could be a long time before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.
+1
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Grant Ingersoll wrote:
>
> On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:
>
>> I actually think indexing should try to be as robust as possible.
>> You
>> could test like crazy and never hit a massive term, go into
>> production
>> (say, ship your app to lots of your customer's computers) only to
>> suddenly see this exception. In general it could be a long time
>> before
>> you "accidentally" our users see this.
>>
>> So I'm thinking we should have the default behavior, in IndexWriter,
>> be to skip immense terms?
>>
>> Then people can use TokenFilter to change this behavior if they want.
>>
> +1. We could log it, right?
Yes, to IndexWriter's infoStream, if it's set. I'll do that...
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:
> I actually think indexing should try to be as robust as possible. You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception. In general it could be a long time
> before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.
>
+1. We could log it, right?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
I actually think indexing should try to be as robust as possible. You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception. In general it could be a long time before
you "accidentally" our users see this.
So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?
Then people can use TokenFilter to change this behavior if they want.
Mike
Yonik Seeley <yo...@apache.org> wrote:
> On Dec 31, 2007 12:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> > Sure, but I mean in the >16K (in other words, in the case where
> > DocsWriter fails, which presumably only DocsWriter knows about) case.
> > I want the option to ignore tokens larger than that instead of failing/
> > throwing an exception.
>
> I think the issue here is what the default behavior for IndexWriter should be.
>
> If configuration is required because something other than the default
> is desired, then one could use a TokenFilter to change the behavior
> rather than changing options on IndexWriter. Using a TokenFilter is
> much more flexible.
>
> > Imagine I am charged w/ indexing some data
> > that I don't know anything about (i.e. computer forensics), my goal
> > would be to index as much as possible in my first raw pass, so that I
> > can then begin to explore the dataset. Having it completely discard
> > the document is not a good thing, but throwing away some large binary
> > tokens would be acceptable (especially if I get warnings about said
> > tokens) and robust.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 12:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Sure, but I mean in the >16K (in other words, in the case where
> DocsWriter fails, which presumably only DocsWriter knows about) case.
> I want the option to ignore tokens larger than that instead of failing/
> throwing an exception.
I think the issue here is what the default behavior for IndexWriter should be.
If configuration is required because something other than the default
is desired, then one could use a TokenFilter to change the behavior
rather than changing options on IndexWriter. Using a TokenFilter is
much more flexible.
> Imagine I am charged w/ indexing some data
> that I don't know anything about (i.e. computer forensics), my goal
> would be to index as much as possible in my first raw pass, so that I
> can then begin to explore the dataset. Having it completely discard
> the document is not a good thing, but throwing away some large binary
> tokens would be acceptable (especially if I get warnings about said
> tokens) and robust.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:
> On Dec 31, 2007 11:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
>>> I meant (1)... it leaves the core smaller.
>>> I don't see any reason to have logic to truncate or discard tokens
>>> in
>>> the core indexing code (except to handle tokens >16k as an error
>>> condition).
>>
>> I would agree here, with the exception that I want the option for it
>> to be treated as an error.
>
> That should also be possible via an analyzer component throwing an
> exception.
>
Sure, but I mean in the >16K (in other words, in the case where
DocsWriter fails, which presumably only DocsWriter knows about) case.
I want the option to ignore tokens larger than that instead of failing/
throwing an exception. Imagine I am charged w/ indexing some data
that I don't know anything about (i.e. computer forensics), my goal
would be to index as much as possible in my first raw pass, so that I
can then begin to explore the dataset. Having it completely discard
the document is not a good thing, but throwing away some large binary
tokens would be acceptable (especially if I get warnings about said
tokens) and robust.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 11:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
> > I meant (1)... it leaves the core smaller.
> > I don't see any reason to have logic to truncate or discard tokens in
> > the core indexing code (except to handle tokens >16k as an error
> > condition).
>
> I would agree here, with the exception that I want the option for it
> to be treated as an error.
That should also be possible via an analyzer component throwing an exception.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
> On Dec 31, 2007 11:37 AM, Doron Cohen <cd...@gmail.com> wrote:
>>
>> On Dec 31, 2007 6:10 PM, Yonik Seeley <yo...@apache.org> wrote:
>>
>> I think I like the 3'rd option - is this what you meant?
>
> I meant (1)... it leaves the core smaller.
> I don't see any reason to have logic to truncate or discard tokens in
> the core indexing code (except to handle tokens >16k as an error
> condition).
I would agree here, with the exception that I want the option for it
to be treated as an error. In some cases, I would be just as happy
for it to silently ignore the token, or to log it.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 11:37 AM, Doron Cohen <cd...@gmail.com> wrote:
>
> On Dec 31, 2007 6:10 PM, Yonik Seeley <yo...@apache.org> wrote:
>
> > On Dec 31, 2007 5:53 AM, Michael McCandless <lu...@mikemccandless.com>
> > wrote:
> > > Doron Cohen <cd...@gmail.com> wrote:
> > > > I like the approach of configuration of this behavior in Analysis
> > > > (and so IndexWriter can throw an exception on such errors).
> > > >
> > > > It seems that this should be a property of Analyzer vs.
> > > > just StandardAnalyzer, right?
> > > >
> > > > It can probably be a "policy" property, with two parameters:
> > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > > generating too long tokens.
> > >
> > > Agreed, this should be generic/shared to all analyzers.
> > >
> > > But maybe for 2.3, we just truncate any too-long term to the max
> > > allowed size, and then after 2.3 we make this a settable "policy"?
> >
> > But we already have a nice component model for analyzers...
> > why not just encapsulate truncation/discarding in a TokenFilter?
>
>
> Makes sense, especially for the implementation aspect.
> I'm not sure what API you have in mind:
>
> (1) leave that for applications, to append such a
> TokenFilter to their Analyzer (== no change),
>
> (2) DocumentsWriter to create such a TokenFilter
> under the cover, to force behavior that is defined (where?), or
>
> (3) have an IndexingTokenFilter assigned to IndexWriter,
> make the default such filter trim/ignore/whatever as discussed
> and then applications can set a different IndexingTokenFilter for
> changing the default behavior?
>
> I think I like the 3'rd option - is this what you meant?
I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens in
the core indexing code (except to handle tokens >16k as an error
condition).
Most of the time you want to catch those large tokens early on in the
chain anyway (put the filter right after the tokenizer). Doing it
later could cause exceptions or issues with other token filters that
might not be expecting huge tokens.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Doron Cohen <cd...@gmail.com>.
On Dec 31, 2007 6:10 PM, Yonik Seeley <yo...@apache.org> wrote:
> On Dec 31, 2007 5:53 AM, Michael McCandless <lu...@mikemccandless.com>
> wrote:
> > Doron Cohen <cd...@gmail.com> wrote:
> > > I like the approach of configuration of this behavior in Analysis
> > > (and so IndexWriter can throw an exception on such errors).
> > >
> > > It seems that this should be a property of Analyzer vs.
> > > just StandardAnalyzer, right?
> > >
> > > It can probably be a "policy" property, with two parameters:
> > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > generating too long tokens.
> >
> > Agreed, this should be generic/shared to all analyzers.
> >
> > But maybe for 2.3, we just truncate any too-long term to the max
> > allowed size, and then after 2.3 we make this a settable "policy"?
>
> But we already have a nice component model for analyzers...
> why not just encapsulate truncation/discarding in a TokenFilter?
Makes sense, especially for the implementation aspect.
I'm not sure what API you have in mind:
(1) leave that for applications, to append such a
TokenFilter to their Analyzer (== no change),
(2) DocumentsWriter to create such a TokenFilter
under the cover, to force behavior that is defined (where?), or
(3) have an IndexingTokenFilter assigned to IndexWriter,
make the default such filter trim/ignore/whatever as discussed
and then applications can set a different IndexingTokenFilter for
changing the default behavior?
I think I like the 3'rd option - is this what you meant?
Doron
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 5:53 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> Doron Cohen <cd...@gmail.com> wrote:
> > I like the approach of configuration of this behavior in Analysis
> > (and so IndexWriter can throw an exception on such errors).
> >
> > It seems that this should be a property of Analyzer vs.
> > just StandardAnalyzer, right?
> >
> > It can probably be a "policy" property, with two parameters:
> > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > generating too long tokens.
>
> Agreed, this should be generic/shared to all analyzers.
>
> But maybe for 2.3, we just truncate any too-long term to the max
> allowed size, and then after 2.3 we make this a settable "policy"?
But we already have a nice component model for analyzers...
why not just encapsulate truncation/discarding in a TokenFilter?
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Doron Cohen <cd...@gmail.com> wrote:
> I like the approach of configuration of this behavior in Analysis
> (and so IndexWriter can throw an exception on such errors).
>
> It seems that this should be a property of Analyzer vs.
> just StandardAnalyzer, right?
>
> It can probably be a "policy" property, with two parameters:
> 1) maxLength, 2) action: chop/split/ignore/raiseException when
> generating too long tokens.
Agreed, this should be generic/shared to all analyzers.
But maybe for 2.3, we just truncate any too-long term to the max
allowed size, and then after 2.3 we make this a settable "policy"?
> Doron
>
> On Dec 21, 2007 10:46 PM, Michael McCandless <lu...@mikemccandless.com>
> wrote:
>
> >
> > I think this is a good approach -- any objections?
> >
> > This way, IndexWriter is in-your-face (throws TermTooLongException on
> > seeing a massive term), but StandardAnalyzer is robust (silently
> > skips or prefix's the too-long terms).
> >
> > Mike
> >
> > Gabi Steinberg wrote:
> >
> > > How about defaulting to a max token size of 16K in
> > > StandardTokenizer, so that it never causes an IndexWriter
> > > exception, with an option to reduce that size?
> > >
> > > The backward incompatibilty is limited then - tokens exceeding 16K
> > > will NOT causing an IndexWriter exception. In 3.0 we can reduce
> > > that default to a useful size.
> > >
> > > The option to truncate the token can be useful, I think. It will
> > > index the max size prefix of the long tokens. You can still find
> > > them, pretty accurately - this becomes a prefix search, but is
> > > unlikely to return multiple values because it's a long prefix. It
> > > allow you to choose a relatively small max, such as 32 or 64,
> > > reducing the overhead caused by junk in the documents while
> > > minimizing the chance of not finding something.
> > >
> > > Gabi.
> > >
> > > Michael McCandless wrote:
> > >> Gabi Steinberg wrote:
> > >>> On balance, I think that dropping the document makes sense. I
> > >>> think Yonik is right in that ensuring that keys are useful - and
> > >>> indexable - is the tokenizer's job.
> > >>>
> > >>> StandardTokenizer, in my opinion, should behave similarly to a
> > >>> person looking at a document and deciding which tokens should be
> > >>> indexed. Few people would argue that a 16K block of binary data
> > >>> is useful for searching, but it's reasonable to suggest that the
> > >>> text around it is useful.
> > >>>
> > >>> I know that one can add the LengthFilter to avoid this problem,
> > >>> but this is not really intuitive; one does not expect the
> > >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> > >>>
> > >>> My vote is to:
> > >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> > >>> suggested
> > >>> - because uninformed user would start with StandardTokenizer, I
> > >>> think it should limit token size to 128 bytes, and add options to
> > >>> change that size, choose between truncating or dropping longer
> > >>> tokens, and in no case produce tokens longer that what
> > >>> IndexWriter can digest.
> > >> I like this idea, though we probably can't do that until 3.0 so we
> > >> don't break backwards compatibility?
> > > ...
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Doron Cohen <cd...@gmail.com>.
I like the approach of configuration of this behavior in Analysis
(and so IndexWriter can throw an exception on such errors).
It seems that this should be a property of Analyzer vs.
just StandardAnalyzer, right?
It can probably be a "policy" property, with two parameters:
1) maxLength, 2) action: chop/split/ignore/raiseException when
generating too long tokens.
Doron
On Dec 21, 2007 10:46 PM, Michael McCandless <lu...@mikemccandless.com>
wrote:
>
> I think this is a good approach -- any objections?
>
> This way, IndexWriter is in-your-face (throws TermTooLongException on
> seeing a massive term), but StandardAnalyzer is robust (silently
> skips or prefix's the too-long terms).
>
> Mike
>
> Gabi Steinberg wrote:
>
> > How about defaulting to a max token size of 16K in
> > StandardTokenizer, so that it never causes an IndexWriter
> > exception, with an option to reduce that size?
> >
> > The backward incompatibilty is limited then - tokens exceeding 16K
> > will NOT causing an IndexWriter exception. In 3.0 we can reduce
> > that default to a useful size.
> >
> > The option to truncate the token can be useful, I think. It will
> > index the max size prefix of the long tokens. You can still find
> > them, pretty accurately - this becomes a prefix search, but is
> > unlikely to return multiple values because it's a long prefix. It
> > allow you to choose a relatively small max, such as 32 or 64,
> > reducing the overhead caused by junk in the documents while
> > minimizing the chance of not finding something.
> >
> > Gabi.
> >
> > Michael McCandless wrote:
> >> Gabi Steinberg wrote:
> >>> On balance, I think that dropping the document makes sense. I
> >>> think Yonik is right in that ensuring that keys are useful - and
> >>> indexable - is the tokenizer's job.
> >>>
> >>> StandardTokenizer, in my opinion, should behave similarly to a
> >>> person looking at a document and deciding which tokens should be
> >>> indexed. Few people would argue that a 16K block of binary data
> >>> is useful for searching, but it's reasonable to suggest that the
> >>> text around it is useful.
> >>>
> >>> I know that one can add the LengthFilter to avoid this problem,
> >>> but this is not really intuitive; one does not expect the
> >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> >>>
> >>> My vote is to:
> >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> >>> suggested
> >>> - because uninformed user would start with StandardTokenizer, I
> >>> think it should limit token size to 128 bytes, and add options to
> >>> change that size, choose between truncating or dropping longer
> >>> tokens, and in no case produce tokens longer that what
> >>> IndexWriter can digest.
> >> I like this idea, though we probably can't do that until 3.0 so we
> >> don't break backwards compatibility?
> > ...
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
I think this is a good approach -- any objections?
This way, IndexWriter is in-your-face (throws TermTooLongException on
seeing a massive term), but StandardAnalyzer is robust (silently
skips or prefix's the too-long terms).
Mike
Gabi Steinberg wrote:
> How about defaulting to a max token size of 16K in
> StandardTokenizer, so that it never causes an IndexWriter
> exception, with an option to reduce that size?
>
> The backward incompatibilty is limited then - tokens exceeding 16K
> will NOT causing an IndexWriter exception. In 3.0 we can reduce
> that default to a useful size.
>
> The option to truncate the token can be useful, I think. It will
> index the max size prefix of the long tokens. You can still find
> them, pretty accurately - this becomes a prefix search, but is
> unlikely to return multiple values because it's a long prefix. It
> allow you to choose a relatively small max, such as 32 or 64,
> reducing the overhead caused by junk in the documents while
> minimizing the chance of not finding something.
>
> Gabi.
>
> Michael McCandless wrote:
>> Gabi Steinberg wrote:
>>> On balance, I think that dropping the document makes sense. I
>>> think Yonik is right in that ensuring that keys are useful - and
>>> indexable - is the tokenizer's job.
>>>
>>> StandardTokenizer, in my opinion, should behave similarly to a
>>> person looking at a document and deciding which tokens should be
>>> indexed. Few people would argue that a 16K block of binary data
>>> is useful for searching, but it's reasonable to suggest that the
>>> text around it is useful.
>>>
>>> I know that one can add the LengthFilter to avoid this problem,
>>> but this is not really intuitive; one does not expect the
>>> standard tokenizer to generate tokens that IndexWriter chokes on.
>>>
>>> My vote is to:
>>> - drop documents with tokens longer than 16K, as Mike and Yonik
>>> suggested
>>> - because uninformed user would start with StandardTokenizer, I
>>> think it should limit token size to 128 bytes, and add options to
>>> change that size, choose between truncating or dropping longer
>>> tokens, and in no case produce tokens longer that what
>>> IndexWriter can digest.
>> I like this idea, though we probably can't do that until 3.0 so we
>> don't break backwards compatibility?
> ...
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Gabi Steinberg <ga...@comcast.net>.
How about defaulting to a max token size of 16K in StandardTokenizer, so
that it never causes an IndexWriter exception, with an option to reduce
that size?
The backward incompatibilty is limited then - tokens exceeding 16K will
NOT causing an IndexWriter exception. In 3.0 we can reduce that default
to a useful size.
The option to truncate the token can be useful, I think. It will index
the max size prefix of the long tokens. You can still find them, pretty
accurately - this becomes a prefix search, but is unlikely to return
multiple values because it's a long prefix. It allow you to choose a
relatively small max, such as 32 or 64, reducing the overhead caused by
junk in the documents while minimizing the chance of not finding something.
Gabi.
Michael McCandless wrote:
> Gabi Steinberg wrote:
>
>> On balance, I think that dropping the document makes sense. I think
>> Yonik is right in that ensuring that keys are useful - and indexable -
>> is the tokenizer's job.
>>
>> StandardTokenizer, in my opinion, should behave similarly to a person
>> looking at a document and deciding which tokens should be indexed.
>> Few people would argue that a 16K block of binary data is useful for
>> searching, but it's reasonable to suggest that the text around it is
>> useful.
>>
>> I know that one can add the LengthFilter to avoid this problem, but
>> this is not really intuitive; one does not expect the standard
>> tokenizer to generate tokens that IndexWriter chokes on.
>>
>> My vote is to:
>> - drop documents with tokens longer than 16K, as Mike and Yonik suggested
>> - because uninformed user would start with StandardTokenizer, I think
>> it should limit token size to 128 bytes, and add options to change
>> that size, choose between truncating or dropping longer tokens, and in
>> no case produce tokens longer that what IndexWriter can digest.
>
> I like this idea, though we probably can't do that until 3.0 so we don't
> break backwards compatibility?
>
...
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Gabi Steinberg wrote:
> On balance, I think that dropping the document makes sense. I
> think Yonik is right in that ensuring that keys are useful - and
> indexable - is the tokenizer's job.
>
> StandardTokenizer, in my opinion, should behave similarly to a
> person looking at a document and deciding which tokens should be
> indexed. Few people would argue that a 16K block of binary data is
> useful for searching, but it's reasonable to suggest that the text
> around it is useful.
>
> I know that one can add the LengthFilter to avoid this problem, but
> this is not really intuitive; one does not expect the standard
> tokenizer to generate tokens that IndexWriter chokes on.
>
> My vote is to:
> - drop documents with tokens longer than 16K, as Mike and Yonik
> suggested
> - because uninformed user would start with StandardTokenizer, I
> think it should limit token size to 128 bytes, and add options to
> change that size, choose between truncating or dropping longer
> tokens, and in no case produce tokens longer that what IndexWriter
> can digest.
I like this idea, though we probably can't do that until 3.0 so we
don't break backwards compatibility?
> - perhaps come up a clear policy on when a tokenizer should throw
> an exception?
> Gabi Steinberg.
>
> Yonik Seeley wrote:
>> On Dec 20, 2007 11:57 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>> Yonik Seeley wrote:
>>>> On Dec 20, 2007 11:33 AM, Gabi Steinberg
>>>> <ga...@comcast.net> wrote:
>>>>> It might be a bit harsh to drop the document if it has a very long
>>>>> token
>>>>> in it.
>>>> There is really two issues here.
>>>> For long tokens, one could either ignore them or generate an
>>>> exception.
>>> I can see the argument both ways.
>> Me too.
>>> On the one hand, we want indexing
>>> to be robust/resilient, such that massive terms are quietly skipped
>>> (maybe w/ a log to infoStream if its set).
>>>
>>> On the other hand, clearly there is something seriously wrong when
>>> your analyzer is producing a single 16+ KB term, and so it would be
>>> nice to be brittle/in-your-face so the user is forced to deal with/
>>> correct the situation.
>>>
>>> Also, it's really bad once these terms pollute your index. EG
>>> suddenly the Terminfos index can easily take tremendous amounts of
>>> RAM, slow down indexing/merging/searching, etc. This is why
>>> LUCENE-1052 was created. It's alot better if you catch this up
>>> front
>>> then letting it pollute your index.
>>>
>>> If we want to take the "in your face" solution, I think the cutoff
>>> should be less than 16 KB (16 KB is just the hard limit inside DW).
>>>
>>>> For all exceptions generated while indexing a document (that are
>>>> passed through to the user)
>>>> it seems like that document should not be in the index.
>>> I like this disposition because it means the index is in a known
>>> state. It's bad to have partial docs in the index: it can only lead
>>> to more confusion as people try to figure out why some terms work
>>> for
>>> retrieving the doc but others don't.
>> Right... and I think that was the behavior before the indexing code
>> was rewritten since the new single doc segment was only added after
>> the complete document was inverted (hence any exception would prevent
>> it from being added).
>> -Yonik
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Gabi Steinberg <ga...@comcast.net>.
On balance, I think that dropping the document makes sense. I think
Yonik is right in that ensuring that keys are useful - and indexable -
is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a person
looking at a document and deciding which tokens should be indexed. Few
people would argue that a 16K block of binary data is useful for
searching, but it's reasonable to suggest that the text around it is useful.
I know that one can add the LengthFilter to avoid this problem, but this
is not really intuitive; one does not expect the standard tokenizer to
generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik suggested
- because uninformed user would start with StandardTokenizer, I think it
should limit token size to 128 bytes, and add options to change that
size, choose between truncating or dropping longer tokens, and in no
case produce tokens longer that what IndexWriter can digest.
- perhaps come up a clear policy on when a tokenizer should throw an
exception?
Gabi Steinberg.
Yonik Seeley wrote:
> On Dec 20, 2007 11:57 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Yonik Seeley wrote:
>>> On Dec 20, 2007 11:33 AM, Gabi Steinberg
>>> <ga...@comcast.net> wrote:
>>>> It might be a bit harsh to drop the document if it has a very long
>>>> token
>>>> in it.
>>> There is really two issues here.
>>> For long tokens, one could either ignore them or generate an
>>> exception.
>> I can see the argument both ways.
>
> Me too.
>
>> On the one hand, we want indexing
>> to be robust/resilient, such that massive terms are quietly skipped
>> (maybe w/ a log to infoStream if its set).
>>
>> On the other hand, clearly there is something seriously wrong when
>> your analyzer is producing a single 16+ KB term, and so it would be
>> nice to be brittle/in-your-face so the user is forced to deal with/
>> correct the situation.
>>
>> Also, it's really bad once these terms pollute your index. EG
>> suddenly the Terminfos index can easily take tremendous amounts of
>> RAM, slow down indexing/merging/searching, etc. This is why
>> LUCENE-1052 was created. It's alot better if you catch this up front
>> then letting it pollute your index.
>>
>> If we want to take the "in your face" solution, I think the cutoff
>> should be less than 16 KB (16 KB is just the hard limit inside DW).
>>
>>> For all exceptions generated while indexing a document (that are
>>> passed through to the user)
>>> it seems like that document should not be in the index.
>> I like this disposition because it means the index is in a known
>> state. It's bad to have partial docs in the index: it can only lead
>> to more confusion as people try to figure out why some terms work for
>> retrieving the doc but others don't.
>
> Right... and I think that was the behavior before the indexing code
> was rewritten since the new single doc segment was only added after
> the complete document was inverted (hence any exception would prevent
> it from being added).
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 11:57 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> Yonik Seeley wrote:
> > On Dec 20, 2007 11:33 AM, Gabi Steinberg
> > <ga...@comcast.net> wrote:
> >> It might be a bit harsh to drop the document if it has a very long
> >> token
> >> in it.
> >
> > There is really two issues here.
> > For long tokens, one could either ignore them or generate an
> > exception.
>
> I can see the argument both ways.
Me too.
> On the one hand, we want indexing
> to be robust/resilient, such that massive terms are quietly skipped
> (maybe w/ a log to infoStream if its set).
>
> On the other hand, clearly there is something seriously wrong when
> your analyzer is producing a single 16+ KB term, and so it would be
> nice to be brittle/in-your-face so the user is forced to deal with/
> correct the situation.
>
> Also, it's really bad once these terms pollute your index. EG
> suddenly the Terminfos index can easily take tremendous amounts of
> RAM, slow down indexing/merging/searching, etc. This is why
> LUCENE-1052 was created. It's alot better if you catch this up front
> then letting it pollute your index.
>
> If we want to take the "in your face" solution, I think the cutoff
> should be less than 16 KB (16 KB is just the hard limit inside DW).
>
> > For all exceptions generated while indexing a document (that are
> > passed through to the user)
> > it seems like that document should not be in the index.
>
> I like this disposition because it means the index is in a known
> state. It's bad to have partial docs in the index: it can only lead
> to more confusion as people try to figure out why some terms work for
> retrieving the doc but others don't.
Right... and I think that was the behavior before the indexing code
was rewritten since the new single doc segment was only added after
the complete document was inverted (hence any exception would prevent
it from being added).
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:
> On Dec 20, 2007 11:33 AM, Gabi Steinberg
> <ga...@comcast.net> wrote:
>> It might be a bit harsh to drop the document if it has a very long
>> token
>> in it.
>
> There is really two issues here.
> For long tokens, one could either ignore them or generate an
> exception.
I can see the argument both ways. On the one hand, we want indexing
to be robust/resilient, such that massive terms are quietly skipped
(maybe w/ a log to infoStream if its set).
On the other hand, clearly there is something seriously wrong when
your analyzer is producing a single 16+ KB term, and so it would be
nice to be brittle/in-your-face so the user is forced to deal with/
correct the situation.
Also, it's really bad once these terms pollute your index. EG
suddenly the Terminfos index can easily take tremendous amounts of
RAM, slow down indexing/merging/searching, etc. This is why
LUCENE-1052 was created. It's alot better if you catch this up front
then letting it pollute your index.
If we want to take the "in your face" solution, I think the cutoff
should be less than 16 KB (16 KB is just the hard limit inside DW).
> For all exceptions generated while indexing a document (that are
> passed through to the user)
> it seems like that document should not be in the index.
I like this disposition because it means the index is in a known
state. It's bad to have partial docs in the index: it can only lead
to more confusion as people try to figure out why some terms work for
retrieving the doc but others don't.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 11:33 AM, Gabi Steinberg <ga...@comcast.net> wrote:
> It might be a bit harsh to drop the document if it has a very long token
> in it.
There is really two issues here.
For long tokens, one could either ignore them or generate an exception.
For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Gabi Steinberg <ga...@comcast.net>.
It might be a bit harsh to drop the document if it has a very long token
in it. I can imagine documents with embedded binary data, where the
text around the binary data is still useful for search.
My feeling is that long tokens (longer than 128 or 256 bytes) are not
useful for search, and should be truncated or dropped.
Gabi.
Yonik Seeley wrote:
> On Dec 20, 2007 11:15 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Though ... we could simply immediately delete the document when any
>> exception occurs during its processing. So if we think whenever any
>> doc hits an exception, then it should be deleted, it's not so hard to
>> implement that policy...
>
> It does seem like you only want documents in the index that didn't
> generate exceptions... otherwise it doesn't seem like you would know
> exactly what got indexed.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 11:15 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> Though ... we could simply immediately delete the document when any
> exception occurs during its processing. So if we think whenever any
> doc hits an exception, then it should be deleted, it's not so hard to
> implement that policy...
It does seem like you only want documents in the index that didn't
generate exceptions... otherwise it doesn't seem like you would know
exactly what got indexed.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:
> On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'm wondering if the IndexWriter should throw an explicit
>> exception in
>> this case as opposed to a RuntimeException,
>
> RuntimeExceptions can happen in analysis components during indexing
> anyway, so it seems like indexing code should deal with exceptions
> just to be safe. As long as exceptions happinging during indexing
> don't mess up the indexing code, everything should be OK.
>
>> as it seems to me really
>> long tokens should be handled more gracefully. It seems strange that
>> the message says the terms were skipped (which the code does in fact
>> do), but then there is a RuntimeException thrown which usually
>> indicates to me the issue is not recoverable.
>
> It does seem like the document shouldn't be added at all if it caused
> an exception.
> Is that what happens if one of the analyzers causes an exception to
> be thrown?
>
> The other option is to simply ignore tokens above 16K... I'm not sure
> what's right here.
Though ... we could simply immediately delete the document when any
exception occurs during its processing. So if we think whenever any
doc hits an exception, then it should be deleted, it's not so hard to
implement that policy...
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:
>> as it seems to me really
>> long tokens should be handled more gracefully. It seems strange that
>> the message says the terms were skipped (which the code does in fact
>> do), but then there is a RuntimeException thrown which usually
>> indicates to me the issue is not recoverable.
>
> It does seem like the document shouldn't be added at all if it caused
> an exception.
> Is that what happens if one of the analyzers causes an exception to
> be thrown?
>
> The other option is to simply ignore tokens above 16K... I'm not sure
> what's right here.
Right now we are ignoring the too-long tokens and adding the rest.
Unfortunately, because DocumentsWriter directly updates the posting
lists in RAM, it's very difficult to "undo" those tokens we have
already successfully processed & added to the posting lists.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: DocumentsWriter.checkMaxTermLength issues
Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I'm wondering if the IndexWriter should throw an explicit exception in
> this case as opposed to a RuntimeException,
RuntimeExceptions can happen in analysis components during indexing
anyway, so it seems like indexing code should deal with exceptions
just to be safe. As long as exceptions happinging during indexing
don't mess up the indexing code, everything should be OK.
> as it seems to me really
> long tokens should be handled more gracefully. It seems strange that
> the message says the terms were skipped (which the code does in fact
> do), but then there is a RuntimeException thrown which usually
> indicates to me the issue is not recoverable.
It does seem like the document shouldn't be added at all if it caused
an exception.
Is that what happens if one of the analyzers causes an exception to be thrown?
The other option is to simply ignore tokens above 16K... I'm not sure
what's right here.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org