You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Roman Chyla <ro...@gmail.com> on 2013/11/23 06:40:54 UTC

building custom cache - using lucene docids

Hi,
docids are 'ephemeral', but i'd still like to build a search cache with
them (they allow for the fastest joins).

i'm seeing docids keep changing with updates (especially, in the last index
segment) - as per
https://issues.apache.org/jira/browse/LUCENE-2897

That would be fine, because i could build the cache from diff (of index
state) + reading the latest index segment in its entirety. But can I assume
that docids in other segments (other than the last one) will be relatively
stable? (ie. when an old doc is deleted, the docid is marked as removed;
update doc = delete old & create a new docid)?

thanks

roman

Re: building custom cache - using lucene docids

Posted by Roman Chyla <ro...@gmail.com>.
OK, I've spent some time reading the solr/lucene4x classes, and this is
myunderstanding (feel free to correct me ;-))

DirectoryReader holds the opened segments -- each segment has its own
reader, the BaseCompositeReader (or extended classes thereof) store the
offsets per each segment; eg. [0, 5, 22] - meaning, there are 2 segments,
with 5, and 17 docs respectively

The segments are listed in the segments_N file,
http://lucene.apache.org/core/3_0_3/fileformats.html#Segments
File

So theoretically, order of segments could change when merge happens - yet,
every SegmentReader is identified by unique name and this name doesn't
change unless the segment itself changed (ie. docs were deleted; or got
more docs) - so it is possible to rely on this name to know what has not
changed

the name is coming from SegmentInfo (check its toString method) -- the
SegmentInfo has a method equals() that will consider as equal the readers
with the same name and the same dir (which is useful to know - two readers,
one with deletes, one without, are equal)

Lucene's FieldCache itself is rather complex, but it shows there is a very
clever mechanism (a few actually!) -- a class can register a listener that
will be called whenever an index segments is being closed (this could be
used to invalidate portions of a cache), the relevant classes are:
SegmentReader.CoreClosedListener, IndexReader.ReaderClosedListener

But Lucene is using this mechanism only to purge the cache - so
effectively, every commits triggers cache rebuild. This is the interesting
bit: lots of work could be spared if segments data were reused  (but
admittedly, only sometimes - for data that was fully read into memory, for
anything else, such as terms, the cache reads only some values and is
fetching the rest from the index - so Lucene must close the reader and
rebuild the cache on every commit; but that is not my case, as I am to copy
values from an index, and store them in memory...)

the weird 'recyclation' of docids I've observed can probably be explained
by the fact that the index reader contains segments and near realtime
readers (but I'm not sure about this)

To conclude: it is possible to build a cache that updates itself (with only
changes committed since the last build) - this will have impact on how fast
new searcher is ready to serve requests

HTH somebody else too :)

  roman



On Mon, Nov 25, 2013 at 7:54 PM, Roman Chyla <ro...@gmail.com> wrote:

>
>
>
> On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Roman,
>>
>> I don't fully understand your question. After segment is flushed it's
>> never
>> changed, hence segment-local docids are always the same. Due to merge
>> segment can gone, its' docs become new ones in another segment.  This is
>> true for 'global' (Solr-style) docnums, which can flip after merge is
>> happened in the middle of the segments' chain.
>> As well you are saying about segmented cache I can propose you to look at
>> CachingWrapperFilter and NoOpRegenerator as a pattern for such data
>> structures.
>>
>
> Thanks Mikhail, the CWF confirms that the idea of regenerating just part
> of the cache is doable. The CacheRegenerators, on the other hand, make no
> sense to me - and they are not given any 'signals', so they don't know if
> they are in the middle of some regeneration or not, and they should not
> keep a state (of previous index) - as they can be shared by threads that
> build the cache
>
> Best,
>
>   roman
>
>
>>
>>
>> On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla <ro...@gmail.com>
>> wrote:
>>
>> > Hi,
>> > docids are 'ephemeral', but i'd still like to build a search cache with
>> > them (they allow for the fastest joins).
>> >
>> > i'm seeing docids keep changing with updates (especially, in the last
>> index
>> > segment) - as per
>> > https://issues.apache.org/jira/browse/LUCENE-2897
>> >
>> > That would be fine, because i could build the cache from diff (of index
>> > state) + reading the latest index segment in its entirety. But can I
>> assume
>> > that docids in other segments (other than the last one) will be
>> relatively
>> > stable? (ie. when an old doc is deleted, the docid is marked as removed;
>> > update doc = delete old & create a new docid)?
>> >
>> > thanks
>> >
>> > roman
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>
>

Re: building custom cache - using lucene docids

Posted by Roman Chyla <ro...@gmail.com>.
On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Roman,
>
> I don't fully understand your question. After segment is flushed it's never
> changed, hence segment-local docids are always the same. Due to merge
> segment can gone, its' docs become new ones in another segment.  This is
> true for 'global' (Solr-style) docnums, which can flip after merge is
> happened in the middle of the segments' chain.
> As well you are saying about segmented cache I can propose you to look at
> CachingWrapperFilter and NoOpRegenerator as a pattern for such data
> structures.
>

Thanks Mikhail, the CWF confirms that the idea of regenerating just part of
the cache is doable. The CacheRegenerators, on the other hand, make no
sense to me - and they are not given any 'signals', so they don't know if
they are in the middle of some regeneration or not, and they should not
keep a state (of previous index) - as they can be shared by threads that
build the cache

Best,

  roman


>
>
> On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla <ro...@gmail.com>
> wrote:
>
> > Hi,
> > docids are 'ephemeral', but i'd still like to build a search cache with
> > them (they allow for the fastest joins).
> >
> > i'm seeing docids keep changing with updates (especially, in the last
> index
> > segment) - as per
> > https://issues.apache.org/jira/browse/LUCENE-2897
> >
> > That would be fine, because i could build the cache from diff (of index
> > state) + reading the latest index segment in its entirety. But can I
> assume
> > that docids in other segments (other than the last one) will be
> relatively
> > stable? (ie. when an old doc is deleted, the docid is marked as removed;
> > update doc = delete old & create a new docid)?
> >
> > thanks
> >
> > roman
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: building custom cache - using lucene docids

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Roman,

I don't fully understand your question. After segment is flushed it's never
changed, hence segment-local docids are always the same. Due to merge
segment can gone, its' docs become new ones in another segment.  This is
true for 'global' (Solr-style) docnums, which can flip after merge is
happened in the middle of the segments' chain.
As well you are saying about segmented cache I can propose you to look at
CachingWrapperFilter and NoOpRegenerator as a pattern for such data
structures.



On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla <ro...@gmail.com> wrote:

> Hi,
> docids are 'ephemeral', but i'd still like to build a search cache with
> them (they allow for the fastest joins).
>
> i'm seeing docids keep changing with updates (especially, in the last index
> segment) - as per
> https://issues.apache.org/jira/browse/LUCENE-2897
>
> That would be fine, because i could build the cache from diff (of index
> state) + reading the latest index segment in its entirety. But can I assume
> that docids in other segments (other than the last one) will be relatively
> stable? (ie. when an old doc is deleted, the docid is marked as removed;
> update doc = delete old & create a new docid)?
>
> thanks
>
> roman
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: building custom cache - using lucene docids

Posted by Roman Chyla <ro...@gmail.com>.
On Sun, Nov 24, 2013 at 10:44 AM, Jack Krupansky <ja...@basetechnology.com>wrote:

> We should probably talk about "internal" Lucene document IDs and
> "external" or "rebased" Lucene document IDs. The internal document IDs are
> always "per-segment" and never, ever change for that closed segment. But...
> the application would not normally see these IDs. Usually the externally
> visible Lucene document IDs have been "rebased" to add the sum total count
> of documents (both existing and deleted) of all preceding segments to the
> document IDs of a given segment, producing a "global" (across the full
> index of all segments) Lucene document ID.
>
> So, if you have those three segments, with deleted documents in the first
> two segments, and then merge those first two segments, the
> externally-visible Lucene document IDs for the third segment will suddenly
> all be different, shifted lower by the number of deleted documents that
> were just merged away, even though nothing changed in the third segment
> itself.
>

That's right, and I'm starting to think that if i keep the segment id and
the original offset, i don't need to rebuild that part of the cache,
because it has not been rebased (but I can always update the deleted docs).
It seems simple so I'm suspecting to find a catch somewhere. but if it
works, that could potentially speed up any cache building

Do you have information where the docbase of the segment are stored? Or
which java class I should start my exploration from? [it is somewhat
sprawling complex, so I'm bit lost :)]


>
> Maybe these should be called "local" (to the segment) Lucene document IDs
> and "global" (across all segment) Lucene document IDs. Or, maybe internal
> vs. external is good enough.
>
> In short, it is completely safe to use and save Lucene document IDs, but
> only as long as no merging of segments is performed. Even one tiny merge
> and all subsequent saved document IDs are invalidated. Be careful with your
> merge policy - normally merges are happening in the background,
> automatically.
>

my tests, as per previous email, showed that the last segment docid's are
not that stable. I don't know if it matters that I used the RAMDirectory
for the test, but the docids were being 'recycled' -  the deleted docs were
in the previous segment, then suddently their docids were inside newly
added documents (so maybe solr/lucene is not counting deleted docs, if they
are at the end of a segment...?) i don't know. i'll need to explore the
index segments to understand what was going on there, thanks for any
possible pointers


  roman




>
> -- Jack Krupansky
>
> -----Original Message----- From: Erick Erickson
> Sent: Sunday, November 24, 2013 8:31 AM
> To: solr-user@lucene.apache.org
> Subject: Re: building custom cache - using lucene docids
>
>
> bq: Do i understand you correctly that when two segmets get merged, the
> docids
> (of the original segments) remain the same?
>
> The original segments are unchanged, segments are _never_ changed after
> they're closed. But they'll be thrown away. Say you have segment1 and
> segment2 that get merged into segment3. As soon as the last searcher
> that is looking at segment1 and segment2 is closed, those two segments
> will be deleted from your disk.
>
> But for any given doc, the docid in segment3 will very likely be different
> than it was in segment1 or 2.
>
> I think you're reading too much into LUCENE-2897. I'm pretty sure the
> segment in question is not available to you anyway before this rewrite is
> done,
> but freely admit I don't know much about it.
>
> You're probably going to get into the whole PerSegment family of
> operations,
> which is something I'm not all that familiar with so I'll leave
> explanations
> to others.
>
>
> On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <ro...@gmail.com>
> wrote:
>
>  Hi Erick,
>> Many thanks for the info. An additional question:
>>
>> Do i understand you correctly that when two segmets get merged, the docids
>> (of the original segments) remain the same?
>>
>> (unless, perhaps in situation, they were merged using the last index
>> segment which was opened for writing and where the docids could have
>> suddenly changed in a commit just before the merge)
>>
>> Yes, you guessed right that I am putting my code into the custom cache -
>> so
>> it gets notified on index changes. I don't know yet how, but I think I can
>> find the way to the current active, opened (last) index segment. Which is
>> actively updated (as opposed to just being merged) -- so my definition of
>> 'not last ones' is: where docids don't change. I'd be grateful if someone
>> could spot any problem with such assumption.
>>
>> roman
>>
>>
>>
>>
>> On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>>
>> > bq: But can I assume
>> > that docids in other segments (other than the last one) will be
>> relatively
>> > stable?
>> >
>> > Kinda. Maybe. Maybe not. It depends on how you define "other than the
>> > last one".
>> >
>> > The key is that the internal doc IDs may change when segments are
>> > merged. And old segments get merged. Doc IDs will _never_ change
>> > in a segment once it's closed (although as you note they may be
>> > marked as deleted). But that segment may be written to a new segment
>> > when merging and the internal ID for a given document in the new
>> > segment bears no relationship to internal ID in the old segment.
>> >
>> > BTW, I think you only really care when opening a new searchers. There is
>> > a UserCache (see solrconfig.xml) that gets notified when a new searcher
>> > is being opened to give it an opportunity to refresh itself, is that
>> > useful?
>> >
>> > As long as a searcher is open, it's guaranteed that nothing is changing.
>> > Hard commits with openSearcher=false don't open new searchers, which
>> > is why changes aren't visible until a softCommit or a hard commit with
>> > openSearcher=true despite the fact that the segments are closed.
>> >
>> > FWIW,
>> > Erick
>> >
>> > Best
>> > Erick
>> >
>> >
>> >
>> > On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <ro...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > > docids are 'ephemeral', but i'd still like to build a search cache >
>> > with
>> > > them (they allow for the fastest joins).
>> > >
>> > > i'm seeing docids keep changing with updates (especially, in the last
>> > index
>> > > segment) - as per
>> > > https://issues.apache.org/jira/browse/LUCENE-2897
>> > >
>> > > That would be fine, because i could build the cache from diff (of > >
>> index
>> > > state) + reading the latest index segment in its entirety. But can I
>> > assume
>> > > that docids in other segments (other than the last one) will be
>> > relatively
>> > > stable? (ie. when an old doc is deleted, the docid is marked as
>> removed;
>> > > update doc = delete old & create a new docid)?
>> > >
>> > > thanks
>> > >
>> > > roman
>> > >
>> >
>>
>>
>

Re: building custom cache - using lucene docids

Posted by Jack Krupansky <ja...@basetechnology.com>.
We should probably talk about "internal" Lucene document IDs and "external" 
or "rebased" Lucene document IDs. The internal document IDs are always 
"per-segment" and never, ever change for that closed segment. But... the 
application would not normally see these IDs. Usually the externally visible 
Lucene document IDs have been "rebased" to add the sum total count of 
documents (both existing and deleted) of all preceding segments to the 
document IDs of a given segment, producing a "global" (across the full index 
of all segments) Lucene document ID.

So, if you have those three segments, with deleted documents in the first 
two segments, and then merge those first two segments, the 
externally-visible Lucene document IDs for the third segment will suddenly 
all be different, shifted lower by the number of deleted documents that were 
just merged away, even though nothing changed in the third segment itself.

Maybe these should be called "local" (to the segment) Lucene document IDs 
and "global" (across all segment) Lucene document IDs. Or, maybe internal 
vs. external is good enough.

In short, it is completely safe to use and save Lucene document IDs, but 
only as long as no merging of segments is performed. Even one tiny merge and 
all subsequent saved document IDs are invalidated. Be careful with your 
merge policy - normally merges are happening in the background, 
automatically.

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Sunday, November 24, 2013 8:31 AM
To: solr-user@lucene.apache.org
Subject: Re: building custom cache - using lucene docids

bq: Do i understand you correctly that when two segmets get merged, the
docids
(of the original segments) remain the same?

The original segments are unchanged, segments are _never_ changed after
they're closed. But they'll be thrown away. Say you have segment1 and
segment2 that get merged into segment3. As soon as the last searcher
that is looking at segment1 and segment2 is closed, those two segments
will be deleted from your disk.

But for any given doc, the docid in segment3 will very likely be different
than it was in segment1 or 2.

I think you're reading too much into LUCENE-2897. I'm pretty sure the
segment in question is not available to you anyway before this rewrite is
done,
but freely admit I don't know much about it.

You're probably going to get into the whole PerSegment family of operations,
which is something I'm not all that familiar with so I'll leave
explanations
to others.


On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <ro...@gmail.com> wrote:

> Hi Erick,
> Many thanks for the info. An additional question:
>
> Do i understand you correctly that when two segmets get merged, the docids
> (of the original segments) remain the same?
>
> (unless, perhaps in situation, they were merged using the last index
> segment which was opened for writing and where the docids could have
> suddenly changed in a commit just before the merge)
>
> Yes, you guessed right that I am putting my code into the custom cache - 
> so
> it gets notified on index changes. I don't know yet how, but I think I can
> find the way to the current active, opened (last) index segment. Which is
> actively updated (as opposed to just being merged) -- so my definition of
> 'not last ones' is: where docids don't change. I'd be grateful if someone
> could spot any problem with such assumption.
>
> roman
>
>
>
>
> On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > bq: But can I assume
> > that docids in other segments (other than the last one) will be
> relatively
> > stable?
> >
> > Kinda. Maybe. Maybe not. It depends on how you define "other than the
> > last one".
> >
> > The key is that the internal doc IDs may change when segments are
> > merged. And old segments get merged. Doc IDs will _never_ change
> > in a segment once it's closed (although as you note they may be
> > marked as deleted). But that segment may be written to a new segment
> > when merging and the internal ID for a given document in the new
> > segment bears no relationship to internal ID in the old segment.
> >
> > BTW, I think you only really care when opening a new searchers. There is
> > a UserCache (see solrconfig.xml) that gets notified when a new searcher
> > is being opened to give it an opportunity to refresh itself, is that
> > useful?
> >
> > As long as a searcher is open, it's guaranteed that nothing is changing.
> > Hard commits with openSearcher=false don't open new searchers, which
> > is why changes aren't visible until a softCommit or a hard commit with
> > openSearcher=true despite the fact that the segments are closed.
> >
> > FWIW,
> > Erick
> >
> > Best
> > Erick
> >
> >
> >
> > On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <ro...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > docids are 'ephemeral', but i'd still like to build a search cache 
> > > with
> > > them (they allow for the fastest joins).
> > >
> > > i'm seeing docids keep changing with updates (especially, in the last
> > index
> > > segment) - as per
> > > https://issues.apache.org/jira/browse/LUCENE-2897
> > >
> > > That would be fine, because i could build the cache from diff (of 
> > > index
> > > state) + reading the latest index segment in its entirety. But can I
> > assume
> > > that docids in other segments (other than the last one) will be
> > relatively
> > > stable? (ie. when an old doc is deleted, the docid is marked as
> removed;
> > > update doc = delete old & create a new docid)?
> > >
> > > thanks
> > >
> > > roman
> > >
> >
> 


Re: building custom cache - using lucene docids

Posted by Roman Chyla <ro...@gmail.com>.
On Sun, Nov 24, 2013 at 8:31 AM, Erick Erickson <er...@gmail.com>wrote:

> bq: Do i understand you correctly that when two segmets get merged, the
> docids
> (of the original segments) remain the same?
>
> The original segments are unchanged, segments are _never_ changed after
> they're closed. But they'll be thrown away. Say you have segment1 and
> segment2 that get merged into segment3. As soon as the last searcher
> that is looking at segment1 and segment2 is closed, those two segments
> will be deleted from your disk.
>
> But for any given doc, the docid in segment3 will very likely be different
> than it was in segment1 or 2.
>

i'm trying to figure this out - i'll have to dig, i suppose. for example,
if the docbase (the docid offset per searcher) was stored together with the
index segment, that would be an indication of 'relative stability of docids'


>
> I think you're reading too much into LUCENE-2897. I'm pretty sure the
> segment in question is not available to you anyway before this rewrite is
> done,
> but freely admit I don't know much about it.
>

i've done tests, committing and overwriting a document and saw (SOLR4.0)
that docids are being recycled. I deleted 2 docs, then added a new document
and guess what: the new document had the docid of the previously deleted
document (but different fields).

That was new to me, so I searched and found the LUCENE-2897 which seemed to
explain that behaviour.


>
> You're probably going to get into the whole PerSegment family of
> operations,
> which is something I'm not all that familiar with so I'll leave
> explanations
> to others.
>

Thank you, it is useful to get insights from various sides,

  roman


>
> On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <ro...@gmail.com>
> wrote:
>
> > Hi Erick,
> > Many thanks for the info. An additional question:
> >
> > Do i understand you correctly that when two segmets get merged, the
> docids
> > (of the original segments) remain the same?
> >
> > (unless, perhaps in situation, they were merged using the last index
> > segment which was opened for writing and where the docids could have
> > suddenly changed in a commit just before the merge)
> >
> > Yes, you guessed right that I am putting my code into the custom cache -
> so
> > it gets notified on index changes. I don't know yet how, but I think I
> can
> > find the way to the current active, opened (last) index segment. Which is
> > actively updated (as opposed to just being merged) -- so my definition of
> > 'not last ones' is: where docids don't change. I'd be grateful if someone
> > could spot any problem with such assumption.
> >
> > roman
> >
> >
> >
> >
> > On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > bq: But can I assume
> > > that docids in other segments (other than the last one) will be
> > relatively
> > > stable?
> > >
> > > Kinda. Maybe. Maybe not. It depends on how you define "other than the
> > > last one".
> > >
> > > The key is that the internal doc IDs may change when segments are
> > > merged. And old segments get merged. Doc IDs will _never_ change
> > > in a segment once it's closed (although as you note they may be
> > > marked as deleted). But that segment may be written to a new segment
> > > when merging and the internal ID for a given document in the new
> > > segment bears no relationship to internal ID in the old segment.
> > >
> > > BTW, I think you only really care when opening a new searchers. There
> is
> > > a UserCache (see solrconfig.xml) that gets notified when a new searcher
> > > is being opened to give it an opportunity to refresh itself, is that
> > > useful?
> > >
> > > As long as a searcher is open, it's guaranteed that nothing is
> changing.
> > > Hard commits with openSearcher=false don't open new searchers, which
> > > is why changes aren't visible until a softCommit or a hard commit with
> > > openSearcher=true despite the fact that the segments are closed.
> > >
> > > FWIW,
> > > Erick
> > >
> > > Best
> > > Erick
> > >
> > >
> > >
> > > On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <ro...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > > docids are 'ephemeral', but i'd still like to build a search cache
> with
> > > > them (they allow for the fastest joins).
> > > >
> > > > i'm seeing docids keep changing with updates (especially, in the last
> > > index
> > > > segment) - as per
> > > > https://issues.apache.org/jira/browse/LUCENE-2897
> > > >
> > > > That would be fine, because i could build the cache from diff (of
> index
> > > > state) + reading the latest index segment in its entirety. But can I
> > > assume
> > > > that docids in other segments (other than the last one) will be
> > > relatively
> > > > stable? (ie. when an old doc is deleted, the docid is marked as
> > removed;
> > > > update doc = delete old & create a new docid)?
> > > >
> > > > thanks
> > > >
> > > > roman
> > > >
> > >
> >
>

Re: building custom cache - using lucene docids

Posted by Erick Erickson <er...@gmail.com>.
bq: Do i understand you correctly that when two segmets get merged, the
docids
(of the original segments) remain the same?

The original segments are unchanged, segments are _never_ changed after
they're closed. But they'll be thrown away. Say you have segment1 and
segment2 that get merged into segment3. As soon as the last searcher
that is looking at segment1 and segment2 is closed, those two segments
will be deleted from your disk.

But for any given doc, the docid in segment3 will very likely be different
than it was in segment1 or 2.

I think you're reading too much into LUCENE-2897. I'm pretty sure the
segment in question is not available to you anyway before this rewrite is
done,
but freely admit I don't know much about it.

You're probably going to get into the whole PerSegment family of operations,
which is something I'm not all that familiar with so I'll leave
explanations
to others.


On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <ro...@gmail.com> wrote:

> Hi Erick,
> Many thanks for the info. An additional question:
>
> Do i understand you correctly that when two segmets get merged, the docids
> (of the original segments) remain the same?
>
> (unless, perhaps in situation, they were merged using the last index
> segment which was opened for writing and where the docids could have
> suddenly changed in a commit just before the merge)
>
> Yes, you guessed right that I am putting my code into the custom cache - so
> it gets notified on index changes. I don't know yet how, but I think I can
> find the way to the current active, opened (last) index segment. Which is
> actively updated (as opposed to just being merged) -- so my definition of
> 'not last ones' is: where docids don't change. I'd be grateful if someone
> could spot any problem with such assumption.
>
> roman
>
>
>
>
> On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > bq: But can I assume
> > that docids in other segments (other than the last one) will be
> relatively
> > stable?
> >
> > Kinda. Maybe. Maybe not. It depends on how you define "other than the
> > last one".
> >
> > The key is that the internal doc IDs may change when segments are
> > merged. And old segments get merged. Doc IDs will _never_ change
> > in a segment once it's closed (although as you note they may be
> > marked as deleted). But that segment may be written to a new segment
> > when merging and the internal ID for a given document in the new
> > segment bears no relationship to internal ID in the old segment.
> >
> > BTW, I think you only really care when opening a new searchers. There is
> > a UserCache (see solrconfig.xml) that gets notified when a new searcher
> > is being opened to give it an opportunity to refresh itself, is that
> > useful?
> >
> > As long as a searcher is open, it's guaranteed that nothing is changing.
> > Hard commits with openSearcher=false don't open new searchers, which
> > is why changes aren't visible until a softCommit or a hard commit with
> > openSearcher=true despite the fact that the segments are closed.
> >
> > FWIW,
> > Erick
> >
> > Best
> > Erick
> >
> >
> >
> > On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <ro...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > docids are 'ephemeral', but i'd still like to build a search cache with
> > > them (they allow for the fastest joins).
> > >
> > > i'm seeing docids keep changing with updates (especially, in the last
> > index
> > > segment) - as per
> > > https://issues.apache.org/jira/browse/LUCENE-2897
> > >
> > > That would be fine, because i could build the cache from diff (of index
> > > state) + reading the latest index segment in its entirety. But can I
> > assume
> > > that docids in other segments (other than the last one) will be
> > relatively
> > > stable? (ie. when an old doc is deleted, the docid is marked as
> removed;
> > > update doc = delete old & create a new docid)?
> > >
> > > thanks
> > >
> > > roman
> > >
> >
>

Re: building custom cache - using lucene docids

Posted by Roman Chyla <ro...@gmail.com>.
Hi Erick,
Many thanks for the info. An additional question:

Do i understand you correctly that when two segmets get merged, the docids
(of the original segments) remain the same?

(unless, perhaps in situation, they were merged using the last index
segment which was opened for writing and where the docids could have
suddenly changed in a commit just before the merge)

Yes, you guessed right that I am putting my code into the custom cache - so
it gets notified on index changes. I don't know yet how, but I think I can
find the way to the current active, opened (last) index segment. Which is
actively updated (as opposed to just being merged) -- so my definition of
'not last ones' is: where docids don't change. I'd be grateful if someone
could spot any problem with such assumption.

roman




On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <er...@gmail.com>wrote:

> bq: But can I assume
> that docids in other segments (other than the last one) will be relatively
> stable?
>
> Kinda. Maybe. Maybe not. It depends on how you define "other than the
> last one".
>
> The key is that the internal doc IDs may change when segments are
> merged. And old segments get merged. Doc IDs will _never_ change
> in a segment once it's closed (although as you note they may be
> marked as deleted). But that segment may be written to a new segment
> when merging and the internal ID for a given document in the new
> segment bears no relationship to internal ID in the old segment.
>
> BTW, I think you only really care when opening a new searchers. There is
> a UserCache (see solrconfig.xml) that gets notified when a new searcher
> is being opened to give it an opportunity to refresh itself, is that
> useful?
>
> As long as a searcher is open, it's guaranteed that nothing is changing.
> Hard commits with openSearcher=false don't open new searchers, which
> is why changes aren't visible until a softCommit or a hard commit with
> openSearcher=true despite the fact that the segments are closed.
>
> FWIW,
> Erick
>
> Best
> Erick
>
>
>
> On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <ro...@gmail.com>
> wrote:
>
> > Hi,
> > docids are 'ephemeral', but i'd still like to build a search cache with
> > them (they allow for the fastest joins).
> >
> > i'm seeing docids keep changing with updates (especially, in the last
> index
> > segment) - as per
> > https://issues.apache.org/jira/browse/LUCENE-2897
> >
> > That would be fine, because i could build the cache from diff (of index
> > state) + reading the latest index segment in its entirety. But can I
> assume
> > that docids in other segments (other than the last one) will be
> relatively
> > stable? (ie. when an old doc is deleted, the docid is marked as removed;
> > update doc = delete old & create a new docid)?
> >
> > thanks
> >
> > roman
> >
>

Re: building custom cache - using lucene docids

Posted by Erick Erickson <er...@gmail.com>.
bq: But can I assume
that docids in other segments (other than the last one) will be relatively
stable?

Kinda. Maybe. Maybe not. It depends on how you define "other than the
last one".

The key is that the internal doc IDs may change when segments are
merged. And old segments get merged. Doc IDs will _never_ change
in a segment once it's closed (although as you note they may be
marked as deleted). But that segment may be written to a new segment
when merging and the internal ID for a given document in the new
segment bears no relationship to internal ID in the old segment.

BTW, I think you only really care when opening a new searchers. There is
a UserCache (see solrconfig.xml) that gets notified when a new searcher
is being opened to give it an opportunity to refresh itself, is that useful?

As long as a searcher is open, it's guaranteed that nothing is changing.
Hard commits with openSearcher=false don't open new searchers, which
is why changes aren't visible until a softCommit or a hard commit with
openSearcher=true despite the fact that the segments are closed.

FWIW,
Erick

Best
Erick



On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <ro...@gmail.com> wrote:

> Hi,
> docids are 'ephemeral', but i'd still like to build a search cache with
> them (they allow for the fastest joins).
>
> i'm seeing docids keep changing with updates (especially, in the last index
> segment) - as per
> https://issues.apache.org/jira/browse/LUCENE-2897
>
> That would be fine, because i could build the cache from diff (of index
> state) + reading the latest index segment in its entirety. But can I assume
> that docids in other segments (other than the last one) will be relatively
> stable? (ie. when an old doc is deleted, the docid is marked as removed;
> update doc = delete old & create a new docid)?
>
> thanks
>
> roman
>