You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Daniel Noll <da...@nuix.com> on 2008/03/11 06:45:27 UTC

Document ID shuffling under 2.3.x (on merge?)

Hi all.

We're using the document ID to associate extra information stored outside 
Lucene.  Some of this information is being stored at load-time and some 
afterwards; later on it turns out the information stored at load-time is 
returning the wrong results when converting the database contents back into a 
BitSet for filtering.

Using version 2.2.x doesn't appear to cause the problem, so I have been 
wondering if something happened in 2.3.x to change the document IDs.  Having 
already looked to try and determine this myself, it doesn't appear to be 
reordering them in DocumentsWriter, but perhaps there is some subtle 
side-effect of the way segments are merged which has caused this?

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael Busch <bu...@gmail.com>.

Daniel Noll wrote:
> 
> For interest's sake I also timed fetching the document with no FieldSelector, 
> that takes around 410ms for the same documents.  So there is still a big 
> benefit in using the field selector, it just isn't anywhere near enough to 
> get it close to the time it takes to retrieve the doc IDs.
> 
> Daniel
> 

Hi Daniel,

did you try to use Payloads for storing the UIDs in the index?

Check out this thread:
http://markmail.org/message/swkwzsww64tzfkdv#query:per-doc%20payloads+page:1+mid:gbrjaydhdu2dz3n4+state:results

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Thursday 13 March 2008 00:42:59 Erick Erickson wrote:
> I certainly found that lazy loading changed my speed dramatically, but
> that was on a particularly field-heavy index.
>
> I wonder if TermEnum/TermDocs would be fast enough on an indexed
> (UN_TOKENIZED???) field for a unique id.
>
> Mostly, I'm hoping you'll try this and tell me if it works so I don't have
> to sometime <G>....

I added a "uid" field to our existing fields.  After the load there were some 
gaps in the values for this field; presumably those were documents where 
adding the doc failed and adding the fallback doc also failed.  The index 
contains 20004 documents.  Each test I ran over 10 iterations and times below 
are an average of the last 5 as it took around 5 rounds to warm up.

Filter building, for a filter returning 1000 documents randomly selected:

   Time to build filter by UID (100% Derby) - 93ms
   Additional time to build filter by DocID - 12ms (13% penalty)

13% penalty is acceptable IMO.  The problem comes next.

Bulk operation building, for a query returning around 2800 documents:

   Time to build the bulkop by DocID (100% Hits) - 6ms
   Time to fetch the "uid" field from the document - 152ms (2600% penalty)
   Time to do the DB query (not counting commit though) - 10ms

For interest's sake I also timed fetching the document with no FieldSelector, 
that takes around 410ms for the same documents.  So there is still a big 
benefit in using the field selector, it just isn't anywhere near enough to 
get it close to the time it takes to retrieve the doc IDs.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Erick Erickson <er...@gmail.com>.

I certainly found that lazy loading changed my speed dramatically, but
that was on a particularly field-heavy index.

I wonder if TermEnum/TermDocs would be fast enough on an indexed
(UN_TOKENIZED???) field for a unique id.

Mostly, I'm hoping you'll try this and tell me if it works so I don't have
to sometime <G>....

Erick

On Tue, Mar 11, 2008 at 9:26 PM, Daniel Noll <da...@nuix.com> wrote:

> On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote:
> > But to me, it always seems...er...fraught to even *think* about relying
> > on doc ids. I know you've been around the block with Lucene, but do you
> > have a compelling reason to use the doc ID and not your own unique ID?
>
> From memory it was around a factor of 10 times slower to use a text field
> for
> this; I haven't tested it recently and the case of retrieving the Document
> should be slightly faster now that we have FieldSelector, but it certainly
> won't be faster as to get the document you need the ID in the first place.
>
> For single documents it wasn't a problem, the use cases are:
>  1. Bulk database operations based on the matched documents.
>  2. Creating a filter BitSet based on a database query.
>
> Effectively this is required because Lucene offered no way to update a
> Document after it was indexed; if it had that feature we would never have
> needed a database. ;-)
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote:
> But to me, it always seems...er...fraught to even *think* about relying
> on doc ids. I know you've been around the block with Lucene, but do you
> have a compelling reason to use the doc ID and not your own unique ID?

From memory it was around a factor of 10 times slower to use a text field for 
this; I haven't tested it recently and the case of retrieving the Document 
should be slightly faster now that we have FieldSelector, but it certainly 
won't be faster as to get the document you need the ID in the first place.

For single documents it wasn't a problem, the use cases are:
  1. Bulk database operations based on the matched documents.
  2. Creating a filter BitSet based on a database query.

Effectively this is required because Lucene offered no way to update a 
Document after it was indexed; if it had that feature we would never have 
needed a database. ;-)

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Daniel Noll wrote:

> On Monday 17 March 2008 19:38:46 Michael McCandless wrote:
>> Well ... expungeDeletes() first forces a flush, at which point the
>> deletions are flushed as a .del file against the just flushed
>> segment.  Still, if you call expungeDeletes after every flush
>> (commit) then it's only 1 segment whose deletions need to be expunged
>> so it should be fast.
>
> Now I'm calling it after every failure.  It adds about 15% time if  
> every
> addDocument fails, but because very few documents actually fail the  
> real
> penalty isn't too great.
>
> I can confirm that it fixed the issue, anyway.

OK, glad to hear it!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Monday 17 March 2008 19:38:46 Michael McCandless wrote:
> Well ... expungeDeletes() first forces a flush, at which point the
> deletions are flushed as a .del file against the just flushed
> segment.  Still, if you call expungeDeletes after every flush
> (commit) then it's only 1 segment whose deletions need to be expunged
> so it should be fast.

Now I'm calling it after every failure.  It adds about 15% time if every 
addDocument fails, but because very few documents actually fail the real 
penalty isn't too great.

I can confirm that it fixed the issue, anyway.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Daniel Noll wrote:

> On Thursday 13 March 2008 19:46:20 Michael McCandless wrote:
>> But, when a normal merge of segments with deletions completes, your
>> docIDs will shift.  In trunk we now explicitly compute the docID
>> shifting that happens after a merge, because we don't always flush
>> pending deletes when flushing added docs, but this is all done
>> privately to IndexWriter.
>
> I don't need to worry about deleted documents as such things don't  
> exist in
> our system, hence the optimisation based on document IDs.
>
>> I'm a little confused: you said optimize() introduces the problem,
>> but, it sounds like optimize() should be fixing the problem because
>> it compacts all docIDs to match what you were "guessing" outside of
>> Lucene?  Can you post the full stack trace of the exceptions you're
>> hitting?
>
> You're misunderstanding how we're getting the ID, that's all.   
> We're getting
> it by calling docCount() (after adding) and subtracting 1, which is
> guaranteed to give the right ID at the time of indexing, although  
> of course,
> later is another matter entirely.  We were operating from now out  
> of date
> information which says the IDs don't shift unless you call delete...
>
> Example:
>
>   add document, assume ID 0 (docCount = 1)
>   add document, assume ID 1 (docCount = 2)
>   add document, FAILS - assumed not added
>   re-add document minus reader fields, assume ID 3 (docCount = 4)
>
> So the ID assumptions are correct at this point; when optimize() is  
> called, it
> shifts the third document sucht that it then has ID 2, and our  
> internal
> counts become wrong.

OK now I understand: docCount() is correct at the time, but then when  
a merge or optimize merges a segment that has a document that hit an  
exception, the IDs shift.

> I've backported the expungeDeleted() patch into 2.3 and will be  
> testing it out
> next; seems it will do more or less what we want and merging the  
> deleted
> document should be relatively quick as it will only ever exist in  
> the in
> DocumentsWriter's in-memory buffer.

Well ... expungeDeletes() first forces a flush, at which point the  
deletions are flushed as a .del file against the just flushed  
segment.  Still, if you call expungeDeletes after every flush  
(commit) then it's only 1 segment whose deletions need to be expunged  
so it should be fast.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Thursday 13 March 2008 19:46:20 Michael McCandless wrote:
> But, when a normal merge of segments with deletions completes, your
> docIDs will shift.  In trunk we now explicitly compute the docID
> shifting that happens after a merge, because we don't always flush
> pending deletes when flushing added docs, but this is all done
> privately to IndexWriter.

I don't need to worry about deleted documents as such things don't exist in 
our system, hence the optimisation based on document IDs.

> I'm a little confused: you said optimize() introduces the problem,
> but, it sounds like optimize() should be fixing the problem because
> it compacts all docIDs to match what you were "guessing" outside of
> Lucene?  Can you post the full stack trace of the exceptions you're
> hitting?

You're misunderstanding how we're getting the ID, that's all.  We're getting 
it by calling docCount() (after adding) and subtracting 1, which is 
guaranteed to give the right ID at the time of indexing, although of course, 
later is another matter entirely.  We were operating from now out of date 
information which says the IDs don't shift unless you call delete...

Example:

  add document, assume ID 0 (docCount = 1)
  add document, assume ID 1 (docCount = 2)
  add document, FAILS - assumed not added
  re-add document minus reader fields, assume ID 3 (docCount = 4)

So the ID assumptions are correct at this point; when optimize() is called, it 
shifts the third document sucht that it then has ID 2, and our internal 
counts become wrong.

I've backported the expungeDeleted() patch into 2.3 and will be testing it out 
next; seems it will do more or less what we want and merging the deleted 
document should be relatively quick as it will only ever exist in the in 
DocumentsWriter's in-memory buffer.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Doron Cohen <cd...@gmail.com>.

On Thu, Mar 13, 2008 at 9:30 PM, Doron Cohen <cd...@gmail.com> wrote:

> Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit().
> I suspect this can be related to the problem you see though I am not sure.
> Could you try with the patch there?
> Thanks,
> Doron


Daniel, I was wrong about this - LUCENE-1228 cannot be related to your issue
because
the problem it solves does not exist in 2.3.x (only later). Mike thanks for
pointing this out.
- Doron

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Doron Cohen <cd...@gmail.com>.

Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit().
I suspect this can be related to the problem you see though I am not sure.
Could you try with the patch there?
Thanks,
Doron

On Thu, Mar 13, 2008 at 10:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Daniel Noll wrote:
>
> > On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
> >> OK, I think very likely this is the issue: when IndexWriter hits an
> >> exception while processing a document, the portion of the document
> >> already indexed is left in the index, and then its docID is marked
> >> for deletion.  You can see these deletions in your infoStream:
> >>
> >>    flush 0 buffered deleted terms and 30 deleted docIDs on 20
> >> segments
> >>
> >> This means you have deletions in your index, by docID, and so when
> >> you optimize the docIDs are then compacted.
> >
> > Aha.  Under 2.2, a failure would result in nothing being added to
> > the text
> > index so this would explain the problem.  It would also explain why
> > smaller
> > data sets are less likely to cause the problem (it's less likely
> > for there to
> > be an error in it.)
>
> Yes.
>
> > Workarounds?
> >   - flush() after any IOException from addDocument()  (overhead?)
>
> What exceptions are you actually hitting (is it really an
> IOException)?  I thought something was going wrong in retrieving or
> tokenizing  the document.
>
> I don't think flush() helps because it just flushes the pending
> deletes as well?
>
> >   - use ++ to determine the next document ID instead of
> >     index.getWriter().docCount()  (out of sync after an error but
> > fixes itself
> >     on optimize().
>
> I think this would work, but you're definitely still in the realm of
> "guessing how Lucene assigns docIDs under the hood" so it's risky
> over time.  Likely this is the highest performance option.
>
> But, when a normal merge of segments with deletions completes, your
> docIDs will shift.  In trunk we now explicitly compute the docID
> shifting that happens after a merge, because we don't always flush
> pending deletes when flushing added docs, but this is all done
> privately to IndexWriter.
>
> I'm a little confused: you said optimize() introduces the problem,
> but, it sounds like optimize() should be fixing the problem because
> it compacts all docIDs to match what you were "guessing" outside of
> Lucene?  Can you post the full stack trace of the exceptions you're
> hitting?
>
> >   - Use a field for a separate ID (slower later when reading the
> > index)
>
> Looks too slow based on your results.
>
> Can you pre-load the UID into the FieldCache?  There were also
> discussions recently about adding "column-stride" fields to Lucene,
> basically a faster FieldCache (to load initially), which would apply
> here I think.
>
> >   - ???
>
>
> Trunk has a new expungeDeletes method which should be lower cost than
> optimize, but not necessarily that much lower cost.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Daniel Noll wrote:

> On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
>> OK, I think very likely this is the issue: when IndexWriter hits an
>> exception while processing a document, the portion of the document
>> already indexed is left in the index, and then its docID is marked
>> for deletion.  You can see these deletions in your infoStream:
>>
>>    flush 0 buffered deleted terms and 30 deleted docIDs on 20  
>> segments
>>
>> This means you have deletions in your index, by docID, and so when
>> you optimize the docIDs are then compacted.
>
> Aha.  Under 2.2, a failure would result in nothing being added to  
> the text
> index so this would explain the problem.  It would also explain why  
> smaller
> data sets are less likely to cause the problem (it's less likely  
> for there to
> be an error in it.)

Yes.

> Workarounds?
>   - flush() after any IOException from addDocument()  (overhead?)

What exceptions are you actually hitting (is it really an  
IOException)?  I thought something was going wrong in retrieving or  
tokenizing  the document.

I don't think flush() helps because it just flushes the pending  
deletes as well?

>   - use ++ to determine the next document ID instead of
>     index.getWriter().docCount()  (out of sync after an error but  
> fixes itself
>     on optimize().

I think this would work, but you're definitely still in the realm of  
"guessing how Lucene assigns docIDs under the hood" so it's risky  
over time.  Likely this is the highest performance option.

But, when a normal merge of segments with deletions completes, your  
docIDs will shift.  In trunk we now explicitly compute the docID  
shifting that happens after a merge, because we don't always flush  
pending deletes when flushing added docs, but this is all done  
privately to IndexWriter.

I'm a little confused: you said optimize() introduces the problem,  
but, it sounds like optimize() should be fixing the problem because  
it compacts all docIDs to match what you were "guessing" outside of  
Lucene?  Can you post the full stack trace of the exceptions you're  
hitting?

>   - Use a field for a separate ID (slower later when reading the  
> index)

Looks too slow based on your results.

Can you pre-load the UID into the FieldCache?  There were also  
discussions recently about adding "column-stride" fields to Lucene,  
basically a faster FieldCache (to load initially), which would apply  
here I think.

>   - ???

Trunk has a new expungeDeletes method which should be lower cost than  
optimize, but not necessarily that much lower cost.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
> OK, I think very likely this is the issue: when IndexWriter hits an
> exception while processing a document, the portion of the document
> already indexed is left in the index, and then its docID is marked
> for deletion.  You can see these deletions in your infoStream:
>
>    flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments
>
> This means you have deletions in your index, by docID, and so when
> you optimize the docIDs are then compacted.

Aha.  Under 2.2, a failure would result in nothing being added to the text 
index so this would explain the problem.  It would also explain why smaller 
data sets are less likely to cause the problem (it's less likely for there to 
be an error in it.)

Workarounds?
  - flush() after any IOException from addDocument()  (overhead?)
  - use ++ to determine the next document ID instead of
    index.getWriter().docCount()  (out of sync after an error but fixes itself
    on optimize().
  - Use a field for a separate ID (slower later when reading the index)
  - ???

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Daniel Noll wrote:

> I have filtered out lines in the log which indicated an exception  
> adding the
> document; these occur when our Reader throws an IOException and  
> there were so
> many that it bloated the file.

OK, I think very likely this is the issue: when IndexWriter hits an  
exception while processing a document, the portion of the document  
already indexed is left in the index, and then its docID is marked  
for deletion.  You can see these deletions in your infoStream:

   flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments

This means you have deletions in your index, by docID, and so when  
you optimize the docIDs are then compacted.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Wednesday 12 March 2008 10:20:12 Michael McCandless wrote:
> Oh, so you do not see the problem with SerialMergeScheduler but you
> do with ConcurrentMergeScheduler?
[...]
> Oh, there are no deletions? �Then this is very strange. �Is it �
> optimize that messes up the docIDs? �Or, is it when you add docs �
> after having done an optimize that the newly added docs are messed up?

Ah, my bad.  It happens with both merge schedulers actually, but not during 
normal merges during the indexing, only with optimize().  Also, we're not 
adding docs after calling optimize either.  We're adding them all and merging 
along the way, and then calling optimize() once at the end.  If I comment out 
that one call to optimize() the problem seems to go away entirely.  Although 
to be honest it was happening once before and it looked like it went away, 
and we only just discovered it had returned.

> Hmmm ... optimize does record which segments need to be merged away
> in a HashSet.  Then ConcurrentMergeScheduler will run the necessary
> merges (possibly several at once).  But the merges are still done on
> "contiguous" segments, and when committed the newly merged segment
> replaces that range of segments.  So I don't think this should be re-
> ordering documents.  Can you try running with infoStream set such
> that you get the problem to occur and then post the resulting output?

Attached.

I have filtered out lines in the log which indicated an exception adding the 
document; these occur when our Reader throws an IOException and there were so 
many that it bloated the file.

Daniel

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Erick Erickson <er...@gmail.com>.

But to me, it always seems...er...fraught to even *think* about relying
on doc ids. I know you've been around the block with Lucene, but do you
have a compelling reason to use the doc ID and not your own unique ID?

Best
Erick

On Tue, Mar 11, 2008 at 5:39 PM, Daniel Noll <da...@nuix.com> wrote:

> On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote:
> > Hi Daniel,
> >
> > 2.3 should be no different from 2.2 in that docIDs only "shift" when
> > a merge of segments with deletions completes.
> >
> > Could it be the ConcurrentMergeScheduler?  Merges now run in the
> > background by default and commit whenever they complete.  You can get
> > back to the previous (blocking) behavior by using
> > SerialMergeScheduler instead.
>
> That was my first thought, but SerialMergeScheduler doesn't cause the
> problem.
> I've done a little more investigation since; it turns out that if I don't
> call optimize() then the problem doesn't occur.
>
> Could it be that optimize(int,boolean) is storing the segments to optimise
> in
> a HashSet, which by its nature reorders the segments?
>
> > If it's not that ... can you provide more details about how your
> > applications is relying on docIDs?
>
> As far as that, we assume that if there are N documents in the index then
> the
> next document ID will be N (we determine this before adding the document.)
> As we're only doing this in a single thread and we never delete documents,
> this was previously safe.
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Daniel Noll wrote:

> On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote:
>> Hi Daniel,
>>
>> 2.3 should be no different from 2.2 in that docIDs only "shift" when
>> a merge of segments with deletions completes.
>>
>> Could it be the ConcurrentMergeScheduler?  Merges now run in the
>> background by default and commit whenever they complete.  You can get
>> back to the previous (blocking) behavior by using
>> SerialMergeScheduler instead.
>
> That was my first thought, but SerialMergeScheduler doesn't cause  
> the problem.
> I've done a little more investigation since; it turns out that if I  
> don't
> call optimize() then the problem doesn't occur.

Oh, so you do not see the problem with SerialMergeScheduler but you  
do with ConcurrentMergeScheduler?

> Could it be that optimize(int,boolean) is storing the segments to  
> optimise in
> a HashSet, which by its nature reorders the segments?

Hmmm ... optimize does record which segments need to be merged away  
in a HashSet.  Then ConcurrentMergeScheduler will run the necessary  
merges (possibly several at once).  But the merges are still done on  
"contiguous" segments, and when committed the newly merged segment  
replaces that range of segments.  So I don't think this should be re- 
ordering documents.  Can you try running with infoStream set such  
that you get the problem to occur and then post the resulting output?

>> If it's not that ... can you provide more details about how your
>> applications is relying on docIDs?
>
> As far as that, we assume that if there are N documents in the  
> index then the
> next document ID will be N (we determine this before adding the  
> document.)
> As we're only doing this in a single thread and we never delete  
> documents,
> this was previously safe.

Oh, there are no deletions?  Then this is very strange.  Is it  
optimize that messes up the docIDs?  Or, is it when you add docs  
after having done an optimize that the newly added docs are messed up?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Daniel Noll <da...@nuix.com>.

On Tuesday 11 March 2008 19:55:39 Michael McCandless wrote:
> Hi Daniel,
>
> 2.3 should be no different from 2.2 in that docIDs only "shift" when
> a merge of segments with deletions completes.
>
> Could it be the ConcurrentMergeScheduler?  Merges now run in the
> background by default and commit whenever they complete.  You can get
> back to the previous (blocking) behavior by using
> SerialMergeScheduler instead.

That was my first thought, but SerialMergeScheduler doesn't cause the problem.  
I've done a little more investigation since; it turns out that if I don't 
call optimize() then the problem doesn't occur.

Could it be that optimize(int,boolean) is storing the segments to optimise in 
a HashSet, which by its nature reorders the segments?

> If it's not that ... can you provide more details about how your
> applications is relying on docIDs?

As far as that, we assume that if there are N documents in the index then the 
next document ID will be N (we determine this before adding the document.)  
As we're only doing this in a single thread and we never delete documents, 
this was previously safe.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document ID shuffling under 2.3.x (on merge?)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Daniel,

2.3 should be no different from 2.2 in that docIDs only "shift" when  
a merge of segments with deletions completes.

Could it be the ConcurrentMergeScheduler?  Merges now run in the  
background by default and commit whenever they complete.  You can get  
back to the previous (blocking) behavior by using  
SerialMergeScheduler instead.

If it's not that ... can you provide more details about how your  
applications is relying on docIDs?

Mike

Daniel Noll wrote:

> Hi all.
>
> We're using the document ID to associate extra information stored  
> outside
> Lucene.  Some of this information is being stored at load-time and  
> some
> afterwards; later on it turns out the information stored at load- 
> time is
> returning the wrong results when converting the database contents  
> back into a
> BitSet for filtering.
>
> Using version 2.2.x doesn't appear to cause the problem, so I have  
> been
> wondering if something happened in 2.3.x to change the document  
> IDs.  Having
> already looked to try and determine this myself, it doesn't appear  
> to be
> reordering them in DocumentsWriter, but perhaps there is some subtle
> side-effect of the way segments are merged which has caused this?
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org