You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/12/20 15:41:18 UTC

DocumentsWriter.checkMaxTermLength issues

I am getting the following exception when running against trunk:
java.lang.IllegalArgumentException: at least one term (length 20079)  
exceeds max term length 16383; these terms were skipped
    at  
org 
.apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java: 
1545)
    at  
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1451)
    at  
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)
....

I'm wondering if the IndexWriter should throw an explicit exception in  
this case as opposed to a RuntimeException, as it seems to me really  
long tokens should be handled more gracefully.  It seems strange that  
the message says the terms were skipped (which the code does in fact  
do), but then there is a RuntimeException thrown which usually  
indicates to me the issue is not recoverable.  I am using the  
StandardTokenizer, but I don't think that much matters.

Any thoughts on this?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I am getting the following exception when running against trunk:
> java.lang.IllegalArgumentException: at least one term (length 20079)
> exceeds max term length 16383; these terms were skipped
>     at
> org
> .apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java:
> 1545)
>     at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1451)
>     at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)
> ....
>
> I'm wondering if the IndexWriter should throw an explicit exception in
> this case as opposed to a RuntimeException, as it seems to me really
> long tokens should be handled more gracefully.  It seems strange that
> the message says the terms were skipped (which the code does in fact
> do), but then there is a RuntimeException thrown which usually
> indicates to me the issue is not recoverable.  I am using the
> StandardTokenizer, but I don't think that much matters.
>
> Any thoughts on this?

I think it's a good to bring attention to it and not sweep it under the rug.
It indicates potential issues or problems with analysis or the data.
The user can use a LengthFilter to explicitly throw long tokens away.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 20, 2007, at 10:55 AM, Yonik Seeley wrote:

> On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'm wondering if the IndexWriter should throw an explicit exception  
>> in
>> this case as opposed to a RuntimeException,
>
> RuntimeExceptions can happen in analysis components during indexing
> anyway, so it seems like indexing code should deal with exceptions
> just to be safe.  As long as exceptions happinging during indexing
> don't mess up the indexing code, everything should be OK.
>
>> as it seems to me really
>> long tokens should be handled more gracefully.  It seems strange that
>> the message says the terms were skipped (which the code does in fact
>> do), but then there is a RuntimeException thrown which usually
>> indicates to me the issue is not recoverable.
>
> It does seem like the document shouldn't be added at all if it caused
> an exception.
> Is that what happens if one of the analyzers causes an exception to  
> be thrown?
>
> The other option is to simply ignore tokens above 16K... I'm not sure
> what's right here.

+1.  The code already does ignore them, that is why the exception  
seems so weird.  DocsWriter gracefully handles the problem, but then  
throws up after the fact.  I would vote to just log it or let the user  
decide somehow.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:

> On Dec 20, 2007 11:15 AM, Michael McCandless  
> <lu...@mikemccandless.com> wrote:
>> Though ... we could simply immediately delete the document when any
>> exception occurs during its processing.  So if we think whenever any
>> doc hits an exception, then it should be deleted, it's not so hard to
>> implement that policy...
>
> It does seem like you only want documents in the index that didn't
> generate exceptions... otherwise it doesn't seem like you would know
> exactly what got indexed.

I agree -- I'll work on this.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
OK I will take this approach... create TermTooLongException  
(subclasses RuntimeException), listed in the javadocs but not the  
throws clause of add/updateDocument.  DW throws this if it encounters  
any term >= 16383 chars in length.

Whenever that exception (or others) are thrown from within DW, it  
means that document will not be added to your index (well, perhaps  
partially added and then deleted).

Probably won't get going on this one until early next year ... I'm  
mostly offline from 12/22 - 1/1.

Mike

Yonik Seeley wrote:

> On Dec 20, 2007 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Makes sense.  I wasn't sure if declaring new exceptions to be thrown
>> is violating back-compat. issues or not (even if they are runtime
>> exceptions)
>
> That's a good question... I know that declared RuntimeExceptions are
> contained in the bytecode (the method signature)... but I don't know
> if they need to match up exactly for things to work.
>
> To be safe I guess we should start out with it commented out (or just
> documented in the JavaDoc).
>
> -Yonik
>
>> On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:
>>
>>> On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org>  
>>> wrote:
>>>> But, I can see the value in the throw the exception
>>>> case too, except I think the API should declare the exception is
>>>> being
>>>> thrown.  It could throw an extension of IOException.
>>>
>>> To be robust, user indexing code needs to catch other types of
>>> exceptions that could be thrown from Anaylzers anyway.
>>>
>>> I don't think this exception (if we choose to keep it as an  
>>> exception)
>>> fits in the class of IOException, where something is normally really
>>> wrong.
>>>
>>> We could declare addDocument() to throw something inherited from
>>> RuntimeException though, right?
>>>
>>> -Yonik
>>>
>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 2:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Makes sense.  I wasn't sure if declaring new exceptions to be thrown
> is violating back-compat. issues or not (even if they are runtime
> exceptions)

That's a good question... I know that declared RuntimeExceptions are
contained in the bytecode (the method signature)... but I don't know
if they need to match up exactly for things to work.

To be safe I guess we should start out with it commented out (or just
documented in the JavaDoc).

-Yonik

> On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:
>
> > On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
> >> But, I can see the value in the throw the exception
> >> case too, except I think the API should declare the exception is
> >> being
> >> thrown.  It could throw an extension of IOException.
> >
> > To be robust, user indexing code needs to catch other types of
> > exceptions that could be thrown from Anaylzers anyway.
> >
> > I don't think this exception (if we choose to keep it as an exception)
> > fits in the class of IOException, where something is normally really
> > wrong.
> >
> > We could declare addDocument() to throw something inherited from
> > RuntimeException though, right?
> >
> > -Yonik
> >
>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Grant Ingersoll <gs...@apache.org>.
Makes sense.  I wasn't sure if declaring new exceptions to be thrown  
is violating back-compat. issues or not (even if they are runtime  
exceptions)

On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote:

> On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> But, I can see the value in the throw the exception
>> case too, except I think the API should declare the exception is  
>> being
>> thrown.  It could throw an extension of IOException.
>
> To be robust, user indexing code needs to catch other types of
> exceptions that could be thrown from Anaylzers anyway.
>
> I don't think this exception (if we choose to keep it as an exception)
> fits in the class of IOException, where something is normally really
> wrong.
>
> We could declare addDocument() to throw something inherited from
> RuntimeException though, right?
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 1:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
> But, I can see the value in the throw the exception
> case too, except I think the API should declare the exception is being
> thrown.  It could throw an extension of IOException.

To be robust, user indexing code needs to catch other types of
exceptions that could be thrown from Anaylzers anyway.

I don't think this exception (if we choose to keep it as an exception)
fits in the class of IOException, where something is normally really
wrong.

We could declare addDocument() to throw something inherited from
RuntimeException though, right?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote:

>
> Yonik Seeley wrote:
>
>> On Dec 20, 2007 11:33 AM, Gabi Steinberg  
>> <ga...@comcast.net> wrote:
>>> It might be a bit harsh to drop the document if it has a very long  
>>> token
>>> in it.
>>
>> There is really two issues here.
>> For long tokens, one could either ignore them or generate an  
>> exception.
>
> I can see the argument both ways.  On the one hand, we want indexing  
> to be robust/resilient, such that massive terms are quietly skipped  
> (maybe w/ a log to infoStream if its set).

This would be fine for me.  In some sense, it is just like applying  
the LengthFilter, which removes tokens silently, too, but works for  
all analyzers.  But, I can see the value in the throw the exception  
case too, except I think the API should declare the exception is being  
thrown.  It could throw an extension of IOException.


>
>
> On the other hand, clearly there is something seriously wrong when  
> your analyzer is producing a single 16+ KB term, and so it would be  
> nice to be brittle/in-your-face so the user is forced to deal with/ 
> correct the situation.
>
> Also, it's really bad once these terms pollute your index.  EG  
> suddenly the Terminfos index can easily take tremendous amounts of  
> RAM, slow down indexing/merging/searching, etc.  This is why  
> LUCENE-1052 was created.  It's alot better if you catch this up  
> front then letting it pollute your index.
>
> If we want to take the "in your face" solution, I think the cutoff  
> should be less than 16 KB (16 KB is just the hard limit inside DW).
>
>> For all exceptions generated while indexing a document (that are
>> passed through to the user)
>> it seems like that document should not be in the index.
>
> I like this disposition because it means the index is in a known  
> state.  It's bad to have partial docs in the index: it can only lead  
> to more confusion as people try to figure out why some terms work  
> for retrieving the doc but others don't.
>
> Mike
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Doron Cohen wrote:

> On Dec 31, 2007 7:54 PM, Michael McCandless  
> <lu...@mikemccandless.com>
> wrote:
>
>> I actually think indexing should try to be as robust as possible.   
>> You
>> could test like crazy and never hit a massive term, go into  
>> production
>> (say, ship your app to lots of your customer's computers) only to
>> suddenly see this exception.  In general it could be a long time  
>> before
>> you "accidentally" our users see this.
>>
>> So I'm thinking we should have the default behavior, in IndexWriter,
>> be to skip immense terms?
>>
>> Then people can use TokenFilter to change this behavior if they want.
>>
>
> +1

OK I will take this approach.

> At first I saw this similar to IndexWriter.setMaxFieldLength(), but  
> it was
> a wrong comparison, because #terms is a "real" indexing/serarch
> characteristic that many applications can benefit from being able
> to modify, whereas a huge token is in most cases a bug.
>
> Just to make sure on the scenario - the only change is to skip too  
> long
> tokens, while any other exception is thrown (not ignored.)

Exactly.  And, on any exception, we will immediately mark any  
partially indexed doc as deleted.

> Also, for a skipped token I think the position increment of the
> following token should be incremented.

Good point; I'll make sure we do.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Doron Cohen <cd...@gmail.com>.
On Dec 31, 2007 7:54 PM, Michael McCandless <lu...@mikemccandless.com>
wrote:

> I actually think indexing should try to be as robust as possible.  You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception.  In general it could be a long time before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.
>

+1

At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was
a wrong comparison, because #terms is a "real" indexing/serarch
characteristic that many applications can benefit from being able
to modify, whereas a huge token is in most cases a bug.

Just to make sure on the scenario - the only change is to skip too long
tokens, while any other exception is thrown (not ignored.)

Also, for a skipped token I think the position increment of the
following token should be incremented.

Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 12:54 PM, Michael McCandless <lu...@mikemccandless.com> wrote:
> I actually think indexing should try to be as robust as possible.  You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception.  In general it could be a long time before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.

+1

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Grant Ingersoll wrote:

>
> On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:
>
>> I actually think indexing should try to be as robust as possible.   
>> You
>> could test like crazy and never hit a massive term, go into  
>> production
>> (say, ship your app to lots of your customer's computers) only to
>> suddenly see this exception.  In general it could be a long time  
>> before
>> you "accidentally" our users see this.
>>
>> So I'm thinking we should have the default behavior, in IndexWriter,
>> be to skip immense terms?
>>
>> Then people can use TokenFilter to change this behavior if they want.
>>
> +1.  We could log it, right?

Yes, to IndexWriter's infoStream, if it's set.  I'll do that...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote:

> I actually think indexing should try to be as robust as possible.  You
> could test like crazy and never hit a massive term, go into production
> (say, ship your app to lots of your customer's computers) only to
> suddenly see this exception.  In general it could be a long time  
> before
> you "accidentally" our users see this.
>
> So I'm thinking we should have the default behavior, in IndexWriter,
> be to skip immense terms?
>
> Then people can use TokenFilter to change this behavior if they want.
>
+1.  We could log it, right?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
I actually think indexing should try to be as robust as possible.  You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception.  In general it could be a long time before
you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.

Mike

Yonik Seeley <yo...@apache.org> wrote:
> On Dec 31, 2007 12:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> > Sure, but I mean in the >16K (in other words, in the case where
> > DocsWriter fails, which presumably only DocsWriter knows about) case.
> > I want the option to ignore tokens larger than that instead of failing/
> > throwing an exception.
>
> I think the issue here is what the default behavior for IndexWriter should be.
>
> If configuration is required because something other than the default
> is desired, then one could use a TokenFilter to change the behavior
> rather than changing options on IndexWriter.  Using a TokenFilter is
> much more flexible.
>
> > Imagine I am charged w/ indexing some data
> > that I don't know anything about (i.e. computer forensics), my goal
> > would be to index as much as possible in my first raw pass, so that I
> > can then begin to explore the dataset.  Having it completely discard
> > the document is not a good thing, but throwing away some large binary
> > tokens would be acceptable (especially if I get warnings about said
> > tokens) and robust.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 12:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Sure, but I mean in the >16K (in other words, in the case where
> DocsWriter fails, which presumably only DocsWriter knows about) case.
> I want the option to ignore tokens larger than that instead of failing/
> throwing an exception.

I think the issue here is what the default behavior for IndexWriter should be.

If configuration is required because something other than the default
is desired, then one could use a TokenFilter to change the behavior
rather than changing options on IndexWriter.  Using a TokenFilter is
much more flexible.

> Imagine I am charged w/ indexing some data
> that I don't know anything about (i.e. computer forensics), my goal
> would be to index as much as possible in my first raw pass, so that I
> can then begin to explore the dataset.  Having it completely discard
> the document is not a good thing, but throwing away some large binary
> tokens would be acceptable (especially if I get warnings about said
> tokens) and robust.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:

> On Dec 31, 2007 11:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
>>> I meant (1)... it leaves the core smaller.
>>> I don't see any reason to have logic to truncate or discard tokens  
>>> in
>>> the core indexing code (except to handle tokens >16k as an error
>>> condition).
>>
>> I would agree here, with the exception that I want the option for it
>> to be treated as an error.
>
> That should also be possible via an analyzer component throwing an  
> exception.
>

Sure, but I mean in the >16K (in other words, in the case where  
DocsWriter fails, which presumably only DocsWriter knows about) case.   
I want the option to ignore tokens larger than that instead of failing/ 
throwing an exception.  Imagine I am charged w/ indexing some data  
that I don't know anything about (i.e. computer forensics), my goal  
would be to index as much as possible in my first raw pass, so that I  
can then begin to explore the dataset.  Having it completely discard  
the document is not a good thing, but throwing away some large binary  
tokens would be acceptable (especially if I get warnings about said  
tokens) and robust.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 11:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
> > I meant (1)... it leaves the core smaller.
> > I don't see any reason to have logic to truncate or discard tokens in
> > the core indexing code (except to handle tokens >16k as an error
> > condition).
>
> I would agree here, with the exception that I want the option for it
> to be treated as an error.

That should also be possible via an analyzer component throwing an exception.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:

> On Dec 31, 2007 11:37 AM, Doron Cohen <cd...@gmail.com> wrote:
>>
>> On Dec 31, 2007 6:10 PM, Yonik Seeley <yo...@apache.org> wrote:
>>
>> I think I like the 3'rd option - is this what you meant?
>
> I meant (1)... it leaves the core smaller.
> I don't see any reason to have logic to truncate or discard tokens in
> the core indexing code (except to handle tokens >16k as an error
> condition).

I would agree here, with the exception that I want the option for it  
to be treated as an error.  In some cases, I would be just as happy  
for it to silently ignore the token, or to log it.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 11:37 AM, Doron Cohen <cd...@gmail.com> wrote:
>
> On Dec 31, 2007 6:10 PM, Yonik Seeley <yo...@apache.org> wrote:
>
> > On Dec 31, 2007 5:53 AM, Michael McCandless <lu...@mikemccandless.com>
> > wrote:
> > > Doron Cohen <cd...@gmail.com> wrote:
> > > > I like the approach of configuration of this behavior in Analysis
> > > > (and so IndexWriter can throw an exception on such errors).
> > > >
> > > > It seems that this should be a property of Analyzer vs.
> > > > just StandardAnalyzer, right?
> > > >
> > > > It can probably be a "policy" property, with two parameters:
> > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > > generating too long tokens.
> > >
> > > Agreed, this should be generic/shared to all analyzers.
> > >
> > > But maybe for 2.3, we just truncate any too-long term to the max
> > > allowed size, and then after 2.3 we make this a settable "policy"?
> >
> > But we already have a nice component model for analyzers...
> > why not just encapsulate truncation/discarding in a TokenFilter?
>
>
> Makes sense, especially for the implementation aspect.
> I'm not sure what API you have in mind:
>
> (1) leave that for applications, to append such a
>     TokenFilter to their Analyzer (== no change),
>
> (2) DocumentsWriter to create such a TokenFilter
>      under the cover, to force behavior that is defined (where?), or
>
> (3) have an IndexingTokenFilter assigned to IndexWriter,
>      make the default such filter trim/ignore/whatever as discussed
>      and then applications can set a different IndexingTokenFilter for
>      changing the default behavior?
>
> I think I like the 3'rd option - is this what you meant?

I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens in
the core indexing code (except to handle tokens >16k as an error
condition).

Most of the time you want to catch those large tokens early on in the
chain anyway (put the filter right after the tokenizer).  Doing it
later could cause exceptions or issues with other token filters that
might not be expecting huge tokens.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Doron Cohen <cd...@gmail.com>.
On Dec 31, 2007 6:10 PM, Yonik Seeley <yo...@apache.org> wrote:

> On Dec 31, 2007 5:53 AM, Michael McCandless <lu...@mikemccandless.com>
> wrote:
> > Doron Cohen <cd...@gmail.com> wrote:
> > > I like the approach of configuration of this behavior in Analysis
> > > (and so IndexWriter can throw an exception on such errors).
> > >
> > > It seems that this should be a property of Analyzer vs.
> > > just StandardAnalyzer, right?
> > >
> > > It can probably be a "policy" property, with two parameters:
> > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > generating too long tokens.
> >
> > Agreed, this should be generic/shared to all analyzers.
> >
> > But maybe for 2.3, we just truncate any too-long term to the max
> > allowed size, and then after 2.3 we make this a settable "policy"?
>
> But we already have a nice component model for analyzers...
> why not just encapsulate truncation/discarding in a TokenFilter?


Makes sense, especially for the implementation aspect.
I'm not sure what API you have in mind:

(1) leave that for applications, to append such a
    TokenFilter to their Analyzer (== no change),

(2) DocumentsWriter to create such a TokenFilter
     under the cover, to force behavior that is defined (where?), or

(3) have an IndexingTokenFilter assigned to IndexWriter,
     make the default such filter trim/ignore/whatever as discussed
     and then applications can set a different IndexingTokenFilter for
     changing the default behavior?

I think I like the 3'rd option - is this what you meant?

Doron

Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 31, 2007 5:53 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> Doron Cohen <cd...@gmail.com> wrote:
> > I like the approach of configuration of this behavior in Analysis
> > (and so IndexWriter can throw an exception on such errors).
> >
> > It seems that this should be a property of Analyzer vs.
> > just StandardAnalyzer, right?
> >
> > It can probably be a "policy" property, with two parameters:
> > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > generating too long tokens.
>
> Agreed, this should be generic/shared to all analyzers.
>
> But maybe for 2.3, we just truncate any too-long term to the max
> allowed size, and then after 2.3 we make this a settable "policy"?

But we already have a nice component model for analyzers...
why not just encapsulate truncation/discarding in a TokenFilter?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Doron Cohen <cd...@gmail.com> wrote:
> I like the approach of configuration of this behavior in Analysis
> (and so IndexWriter can throw an exception on such errors).
>
> It seems that this should be a property of Analyzer vs.
> just StandardAnalyzer, right?
>
> It can probably be a "policy" property, with two parameters:
> 1) maxLength, 2) action: chop/split/ignore/raiseException when
> generating too long tokens.

Agreed, this should be generic/shared to all analyzers.

But maybe for 2.3, we just truncate any too-long term to the max
allowed size, and then after 2.3 we make this a settable "policy"?

> Doron
>
> On Dec 21, 2007 10:46 PM, Michael McCandless <lu...@mikemccandless.com>
> wrote:
>
> >
> > I think this is a good approach -- any objections?
> >
> > This way, IndexWriter is in-your-face (throws TermTooLongException on
> > seeing a massive term), but StandardAnalyzer is robust (silently
> > skips or prefix's the too-long terms).
> >
> > Mike
> >
> > Gabi Steinberg wrote:
> >
> > > How about defaulting to a max token size of 16K in
> > > StandardTokenizer, so that it never causes an IndexWriter
> > > exception, with an option to reduce that size?
> > >
> > > The backward incompatibilty is limited then - tokens exceeding 16K
> > > will NOT causing an IndexWriter exception.  In 3.0 we can reduce
> > > that default to a useful size.
> > >
> > > The option to truncate the token can be useful, I think.  It will
> > > index the max size prefix of the long tokens.  You can still find
> > > them, pretty accurately - this becomes a prefix search, but is
> > > unlikely to return multiple values because it's a long prefix.  It
> > > allow you to choose a relatively small max, such as 32 or 64,
> > > reducing the overhead caused by junk in the documents while
> > > minimizing the chance of not finding something.
> > >
> > > Gabi.
> > >
> > > Michael McCandless wrote:
> > >> Gabi Steinberg wrote:
> > >>> On balance, I think that dropping the document makes sense.  I
> > >>> think Yonik is right in that ensuring that keys are useful - and
> > >>> indexable - is the tokenizer's job.
> > >>>
> > >>> StandardTokenizer, in my opinion, should behave similarly to a
> > >>> person looking at a document and deciding which tokens should be
> > >>> indexed.  Few people would argue that a 16K block of binary data
> > >>> is useful for searching, but it's reasonable to suggest that the
> > >>> text around it is useful.
> > >>>
> > >>> I know that one can add the LengthFilter to avoid this problem,
> > >>> but this is not really intuitive; one does not expect the
> > >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> > >>>
> > >>> My vote is to:
> > >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> > >>> suggested
> > >>> - because uninformed user would start with StandardTokenizer, I
> > >>> think it should limit token size to 128 bytes, and add options to
> > >>> change that size, choose between truncating or dropping longer
> > >>> tokens, and in no case produce tokens longer that what
> > >>> IndexWriter can digest.
> > >> I like this idea, though we probably can't do that until 3.0 so we
> > >> don't break backwards compatibility?
> > > ...
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Doron Cohen <cd...@gmail.com>.
I like the approach of configuration of this behavior in Analysis
(and so IndexWriter can throw an exception on such errors).

It seems that this should be a property of Analyzer vs.
just StandardAnalyzer, right?

It can probably be a "policy" property, with two parameters:
1) maxLength, 2) action: chop/split/ignore/raiseException when
generating too long tokens.

Doron

On Dec 21, 2007 10:46 PM, Michael McCandless <lu...@mikemccandless.com>
wrote:

>
> I think this is a good approach -- any objections?
>
> This way, IndexWriter is in-your-face (throws TermTooLongException on
> seeing a massive term), but StandardAnalyzer is robust (silently
> skips or prefix's the too-long terms).
>
> Mike
>
> Gabi Steinberg wrote:
>
> > How about defaulting to a max token size of 16K in
> > StandardTokenizer, so that it never causes an IndexWriter
> > exception, with an option to reduce that size?
> >
> > The backward incompatibilty is limited then - tokens exceeding 16K
> > will NOT causing an IndexWriter exception.  In 3.0 we can reduce
> > that default to a useful size.
> >
> > The option to truncate the token can be useful, I think.  It will
> > index the max size prefix of the long tokens.  You can still find
> > them, pretty accurately - this becomes a prefix search, but is
> > unlikely to return multiple values because it's a long prefix.  It
> > allow you to choose a relatively small max, such as 32 or 64,
> > reducing the overhead caused by junk in the documents while
> > minimizing the chance of not finding something.
> >
> > Gabi.
> >
> > Michael McCandless wrote:
> >> Gabi Steinberg wrote:
> >>> On balance, I think that dropping the document makes sense.  I
> >>> think Yonik is right in that ensuring that keys are useful - and
> >>> indexable - is the tokenizer's job.
> >>>
> >>> StandardTokenizer, in my opinion, should behave similarly to a
> >>> person looking at a document and deciding which tokens should be
> >>> indexed.  Few people would argue that a 16K block of binary data
> >>> is useful for searching, but it's reasonable to suggest that the
> >>> text around it is useful.
> >>>
> >>> I know that one can add the LengthFilter to avoid this problem,
> >>> but this is not really intuitive; one does not expect the
> >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> >>>
> >>> My vote is to:
> >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> >>> suggested
> >>> - because uninformed user would start with StandardTokenizer, I
> >>> think it should limit token size to 128 bytes, and add options to
> >>> change that size, choose between truncating or dropping longer
> >>> tokens, and in no case produce tokens longer that what
> >>> IndexWriter can digest.
> >> I like this idea, though we probably can't do that until 3.0 so we
> >> don't break backwards compatibility?
> > ...
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
I think this is a good approach -- any objections?

This way, IndexWriter is in-your-face (throws TermTooLongException on  
seeing a massive term), but StandardAnalyzer is robust (silently  
skips or prefix's the too-long terms).

Mike

Gabi Steinberg wrote:

> How about defaulting to a max token size of 16K in  
> StandardTokenizer, so that it never causes an IndexWriter  
> exception, with an option to reduce that size?
>
> The backward incompatibilty is limited then - tokens exceeding 16K  
> will NOT causing an IndexWriter exception.  In 3.0 we can reduce  
> that default to a useful size.
>
> The option to truncate the token can be useful, I think.  It will  
> index the max size prefix of the long tokens.  You can still find  
> them, pretty accurately - this becomes a prefix search, but is  
> unlikely to return multiple values because it's a long prefix.  It  
> allow you to choose a relatively small max, such as 32 or 64,  
> reducing the overhead caused by junk in the documents while  
> minimizing the chance of not finding something.
>
> Gabi.
>
> Michael McCandless wrote:
>> Gabi Steinberg wrote:
>>> On balance, I think that dropping the document makes sense.  I  
>>> think Yonik is right in that ensuring that keys are useful - and  
>>> indexable - is the tokenizer's job.
>>>
>>> StandardTokenizer, in my opinion, should behave similarly to a  
>>> person looking at a document and deciding which tokens should be  
>>> indexed.  Few people would argue that a 16K block of binary data  
>>> is useful for searching, but it's reasonable to suggest that the  
>>> text around it is useful.
>>>
>>> I know that one can add the LengthFilter to avoid this problem,  
>>> but this is not really intuitive; one does not expect the  
>>> standard tokenizer to generate tokens that IndexWriter chokes on.
>>>
>>> My vote is to:
>>> - drop documents with tokens longer than 16K, as Mike and Yonik  
>>> suggested
>>> - because uninformed user would start with StandardTokenizer, I  
>>> think it should limit token size to 128 bytes, and add options to  
>>> change that size, choose between truncating or dropping longer  
>>> tokens, and in no case produce tokens longer that what  
>>> IndexWriter can digest.
>> I like this idea, though we probably can't do that until 3.0 so we  
>> don't break backwards compatibility?
> ...
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Gabi Steinberg <ga...@comcast.net>.
How about defaulting to a max token size of 16K in StandardTokenizer, so 
that it never causes an IndexWriter exception, with an option to reduce 
that size?

The backward incompatibilty is limited then - tokens exceeding 16K will 
NOT causing an IndexWriter exception.  In 3.0 we can reduce that default 
to a useful size.

The option to truncate the token can be useful, I think.  It will index 
the max size prefix of the long tokens.  You can still find them, pretty 
accurately - this becomes a prefix search, but is unlikely to return 
multiple values because it's a long prefix.  It allow you to choose a 
relatively small max, such as 32 or 64, reducing the overhead caused by 
junk in the documents while minimizing the chance of not finding something.

Gabi.

Michael McCandless wrote:
> Gabi Steinberg wrote:
> 
>> On balance, I think that dropping the document makes sense.  I think 
>> Yonik is right in that ensuring that keys are useful - and indexable - 
>> is the tokenizer's job.
>>
>> StandardTokenizer, in my opinion, should behave similarly to a person 
>> looking at a document and deciding which tokens should be indexed.  
>> Few people would argue that a 16K block of binary data is useful for 
>> searching, but it's reasonable to suggest that the text around it is 
>> useful.
>>
>> I know that one can add the LengthFilter to avoid this problem, but 
>> this is not really intuitive; one does not expect the standard 
>> tokenizer to generate tokens that IndexWriter chokes on.
>>
>> My vote is to:
>> - drop documents with tokens longer than 16K, as Mike and Yonik suggested
>> - because uninformed user would start with StandardTokenizer, I think 
>> it should limit token size to 128 bytes, and add options to change 
>> that size, choose between truncating or dropping longer tokens, and in 
>> no case produce tokens longer that what IndexWriter can digest.
> 
> I like this idea, though we probably can't do that until 3.0 so we don't 
> break backwards compatibility?
> 
...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Gabi Steinberg wrote:

> On balance, I think that dropping the document makes sense.  I  
> think Yonik is right in that ensuring that keys are useful - and  
> indexable - is the tokenizer's job.
>
> StandardTokenizer, in my opinion, should behave similarly to a  
> person looking at a document and deciding which tokens should be  
> indexed.  Few people would argue that a 16K block of binary data is  
> useful for searching, but it's reasonable to suggest that the text  
> around it is useful.
>
> I know that one can add the LengthFilter to avoid this problem, but  
> this is not really intuitive; one does not expect the standard  
> tokenizer to generate tokens that IndexWriter chokes on.
>
> My vote is to:
> - drop documents with tokens longer than 16K, as Mike and Yonik  
> suggested
> - because uninformed user would start with StandardTokenizer, I  
> think it should limit token size to 128 bytes, and add options to  
> change that size, choose between truncating or dropping longer  
> tokens, and in no case produce tokens longer that what IndexWriter  
> can digest.

I like this idea, though we probably can't do that until 3.0 so we  
don't break backwards compatibility?

> - perhaps come up a clear policy on when a tokenizer should throw  
> an exception?
> Gabi Steinberg.
>
> Yonik Seeley wrote:
>> On Dec 20, 2007 11:57 AM, Michael McCandless  
>> <lu...@mikemccandless.com> wrote:
>>> Yonik Seeley wrote:
>>>> On Dec 20, 2007 11:33 AM, Gabi Steinberg
>>>> <ga...@comcast.net> wrote:
>>>>> It might be a bit harsh to drop the document if it has a very long
>>>>> token
>>>>> in it.
>>>> There is really two issues here.
>>>> For long tokens, one could either ignore them or generate an
>>>> exception.
>>> I can see the argument both ways.
>> Me too.
>>>  On the one hand, we want indexing
>>> to be robust/resilient, such that massive terms are quietly skipped
>>> (maybe w/ a log to infoStream if its set).
>>>
>>> On the other hand, clearly there is something seriously wrong when
>>> your analyzer is producing a single 16+ KB term, and so it would be
>>> nice to be brittle/in-your-face so the user is forced to deal with/
>>> correct the situation.
>>>
>>> Also, it's really bad once these terms pollute your index.  EG
>>> suddenly the Terminfos index can easily take tremendous amounts of
>>> RAM, slow down indexing/merging/searching, etc.  This is why
>>> LUCENE-1052 was created.  It's alot better if you catch this up  
>>> front
>>> then letting it pollute your index.
>>>
>>> If we want to take the "in your face" solution, I think the cutoff
>>> should be less than 16 KB (16 KB is just the hard limit inside DW).
>>>
>>>> For all exceptions generated while indexing a document (that are
>>>> passed through to the user)
>>>> it seems like that document should not be in the index.
>>> I like this disposition because it means the index is in a known
>>> state.  It's bad to have partial docs in the index: it can only lead
>>> to more confusion as people try to figure out why some terms work  
>>> for
>>> retrieving the doc but others don't.
>> Right... and I think that was the behavior before the indexing code
>> was rewritten since the new single doc segment was only added after
>> the complete document was inverted (hence any exception would prevent
>> it from being added).
>> -Yonik
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Gabi Steinberg <ga...@comcast.net>.
On balance, I think that dropping the document makes sense.  I think 
Yonik is right in that ensuring that keys are useful - and indexable - 
is the tokenizer's job.

StandardTokenizer, in my opinion, should behave similarly to a person 
looking at a document and deciding which tokens should be indexed.  Few 
people would argue that a 16K block of binary data is useful for 
searching, but it's reasonable to suggest that the text around it is useful.

I know that one can add the LengthFilter to avoid this problem, but this 
is not really intuitive; one does not expect the standard tokenizer to 
generate tokens that IndexWriter chokes on.

My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik suggested
- because uninformed user would start with StandardTokenizer, I think it 
should limit token size to 128 bytes, and add options to change that 
size, choose between truncating or dropping longer tokens, and in no 
case produce tokens longer that what IndexWriter can digest.
- perhaps come up a clear policy on when a tokenizer should throw an 
exception?

Gabi Steinberg.

Yonik Seeley wrote:
> On Dec 20, 2007 11:57 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Yonik Seeley wrote:
>>> On Dec 20, 2007 11:33 AM, Gabi Steinberg
>>> <ga...@comcast.net> wrote:
>>>> It might be a bit harsh to drop the document if it has a very long
>>>> token
>>>> in it.
>>> There is really two issues here.
>>> For long tokens, one could either ignore them or generate an
>>> exception.
>> I can see the argument both ways.
> 
> Me too.
> 
>>  On the one hand, we want indexing
>> to be robust/resilient, such that massive terms are quietly skipped
>> (maybe w/ a log to infoStream if its set).
>>
>> On the other hand, clearly there is something seriously wrong when
>> your analyzer is producing a single 16+ KB term, and so it would be
>> nice to be brittle/in-your-face so the user is forced to deal with/
>> correct the situation.
>>
>> Also, it's really bad once these terms pollute your index.  EG
>> suddenly the Terminfos index can easily take tremendous amounts of
>> RAM, slow down indexing/merging/searching, etc.  This is why
>> LUCENE-1052 was created.  It's alot better if you catch this up front
>> then letting it pollute your index.
>>
>> If we want to take the "in your face" solution, I think the cutoff
>> should be less than 16 KB (16 KB is just the hard limit inside DW).
>>
>>> For all exceptions generated while indexing a document (that are
>>> passed through to the user)
>>> it seems like that document should not be in the index.
>> I like this disposition because it means the index is in a known
>> state.  It's bad to have partial docs in the index: it can only lead
>> to more confusion as people try to figure out why some terms work for
>> retrieving the doc but others don't.
> 
> Right... and I think that was the behavior before the indexing code
> was rewritten since the new single doc segment was only added after
> the complete document was inverted (hence any exception would prevent
> it from being added).
> 
> -Yonik
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 11:57 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> Yonik Seeley wrote:
> > On Dec 20, 2007 11:33 AM, Gabi Steinberg
> > <ga...@comcast.net> wrote:
> >> It might be a bit harsh to drop the document if it has a very long
> >> token
> >> in it.
> >
> > There is really two issues here.
> > For long tokens, one could either ignore them or generate an
> > exception.
>
> I can see the argument both ways.

Me too.

>  On the one hand, we want indexing
> to be robust/resilient, such that massive terms are quietly skipped
> (maybe w/ a log to infoStream if its set).
>
> On the other hand, clearly there is something seriously wrong when
> your analyzer is producing a single 16+ KB term, and so it would be
> nice to be brittle/in-your-face so the user is forced to deal with/
> correct the situation.
>
> Also, it's really bad once these terms pollute your index.  EG
> suddenly the Terminfos index can easily take tremendous amounts of
> RAM, slow down indexing/merging/searching, etc.  This is why
> LUCENE-1052 was created.  It's alot better if you catch this up front
> then letting it pollute your index.
>
> If we want to take the "in your face" solution, I think the cutoff
> should be less than 16 KB (16 KB is just the hard limit inside DW).
>
> > For all exceptions generated while indexing a document (that are
> > passed through to the user)
> > it seems like that document should not be in the index.
>
> I like this disposition because it means the index is in a known
> state.  It's bad to have partial docs in the index: it can only lead
> to more confusion as people try to figure out why some terms work for
> retrieving the doc but others don't.

Right... and I think that was the behavior before the indexing code
was rewritten since the new single doc segment was only added after
the complete document was inverted (hence any exception would prevent
it from being added).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:

> On Dec 20, 2007 11:33 AM, Gabi Steinberg  
> <ga...@comcast.net> wrote:
>> It might be a bit harsh to drop the document if it has a very long  
>> token
>> in it.
>
> There is really two issues here.
> For long tokens, one could either ignore them or generate an  
> exception.

I can see the argument both ways.  On the one hand, we want indexing  
to be robust/resilient, such that massive terms are quietly skipped  
(maybe w/ a log to infoStream if its set).

On the other hand, clearly there is something seriously wrong when  
your analyzer is producing a single 16+ KB term, and so it would be  
nice to be brittle/in-your-face so the user is forced to deal with/ 
correct the situation.

Also, it's really bad once these terms pollute your index.  EG  
suddenly the Terminfos index can easily take tremendous amounts of  
RAM, slow down indexing/merging/searching, etc.  This is why  
LUCENE-1052 was created.  It's alot better if you catch this up front  
then letting it pollute your index.

If we want to take the "in your face" solution, I think the cutoff  
should be less than 16 KB (16 KB is just the hard limit inside DW).

> For all exceptions generated while indexing a document (that are
> passed through to the user)
> it seems like that document should not be in the index.

I like this disposition because it means the index is in a known  
state.  It's bad to have partial docs in the index: it can only lead  
to more confusion as people try to figure out why some terms work for  
retrieving the doc but others don't.

Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 11:33 AM, Gabi Steinberg <ga...@comcast.net> wrote:
> It might be a bit harsh to drop the document if it has a very long token
> in it.

There is really two issues here.
For long tokens, one could either ignore them or generate an exception.

For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Gabi Steinberg <ga...@comcast.net>.
It might be a bit harsh to drop the document if it has a very long token 
in it.  I can imagine documents with embedded binary data, where the 
text around the binary data is still useful for search.

My feeling is that long tokens (longer than 128 or 256 bytes) are not 
useful for search, and should be truncated or dropped.

Gabi.

Yonik Seeley wrote:
> On Dec 20, 2007 11:15 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Though ... we could simply immediately delete the document when any
>> exception occurs during its processing.  So if we think whenever any
>> doc hits an exception, then it should be deleted, it's not so hard to
>> implement that policy...
> 
> It does seem like you only want documents in the index that didn't
> generate exceptions... otherwise it doesn't seem like you would know
> exactly what got indexed.
> 
> -Yonik
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 11:15 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> Though ... we could simply immediately delete the document when any
> exception occurs during its processing.  So if we think whenever any
> doc hits an exception, then it should be deleted, it's not so hard to
> implement that policy...

It does seem like you only want documents in the index that didn't
generate exceptions... otherwise it doesn't seem like you would know
exactly what got indexed.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:

> On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'm wondering if the IndexWriter should throw an explicit  
>> exception in
>> this case as opposed to a RuntimeException,
>
> RuntimeExceptions can happen in analysis components during indexing
> anyway, so it seems like indexing code should deal with exceptions
> just to be safe.  As long as exceptions happinging during indexing
> don't mess up the indexing code, everything should be OK.
>
>> as it seems to me really
>> long tokens should be handled more gracefully.  It seems strange that
>> the message says the terms were skipped (which the code does in fact
>> do), but then there is a RuntimeException thrown which usually
>> indicates to me the issue is not recoverable.
>
> It does seem like the document shouldn't be added at all if it caused
> an exception.
> Is that what happens if one of the analyzers causes an exception to  
> be thrown?
>
> The other option is to simply ignore tokens above 16K... I'm not sure
> what's right here.

Though ... we could simply immediately delete the document when any
exception occurs during its processing.  So if we think whenever any
doc hits an exception, then it should be deleted, it's not so hard to
implement that policy...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:

>> as it seems to me really
>> long tokens should be handled more gracefully.  It seems strange that
>> the message says the terms were skipped (which the code does in fact
>> do), but then there is a RuntimeException thrown which usually
>> indicates to me the issue is not recoverable.
>
> It does seem like the document shouldn't be added at all if it caused
> an exception.
> Is that what happens if one of the analyzers causes an exception to  
> be thrown?
>
> The other option is to simply ignore tokens above 16K... I'm not sure
> what's right here.

Right now we are ignoring the too-long tokens and adding the rest.

Unfortunately, because DocumentsWriter directly updates the posting  
lists in RAM, it's very difficult to "undo" those tokens we have  
already successfully processed & added to the posting lists.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: DocumentsWriter.checkMaxTermLength issues

Posted by Yonik Seeley <yo...@apache.org>.
On Dec 20, 2007 9:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I'm wondering if the IndexWriter should throw an explicit exception in
> this case as opposed to a RuntimeException,

RuntimeExceptions can happen in analysis components during indexing
anyway, so it seems like indexing code should deal with exceptions
just to be safe.  As long as exceptions happinging during indexing
don't mess up the indexing code, everything should be OK.

> as it seems to me really
> long tokens should be handled more gracefully.  It seems strange that
> the message says the terms were skipped (which the code does in fact
> do), but then there is a RuntimeException thrown which usually
> indicates to me the issue is not recoverable.

It does seem like the document shouldn't be added at all if it caused
an exception.
Is that what happens if one of the analyzers causes an exception to be thrown?

The other option is to simply ignore tokens above 16K... I'm not sure
what's right here.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org