You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2008/01/22 12:07:16 UTC

Unique doc ids

Hi Team,

the question of how to delete with IndexWriter using doc ids is
currently being discussed on java-user
(http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
thought this is a good time to mention an idea that I recently had. I'm
planning to work on column-stored fields soon (I used to call them
per-document payloads). Then we'll have the ability to store metadata
for each document very efficiently in the index.

This new data structure could be used to store a unique ID for each doc
in the index. The IndexReader would then get an API that provides a
mapping from the dynamic doc ids to the new unique ones. We would also
have to store a reverse mapping (UID -> ID) in the index - we could use
a VInt list + skip list for that.

Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
and provide a new API in IndexWriter "delete by UID". This would allow
to "delete by query" as well. The disadvantage is that the index would
become bigger, but that should still be ok: 8 bytes per doc for the
ID->UID map (assuming we took long for the UID, which I'd suggest). The
UID->ID map might even be a bit smaller initially (using VInts and
VLongs), but might become bigger when the index has lot's of deleted
docs, because then the delta encoding wouldn't be as efficient anymore
for the UIDs.

If RAM permits, the maps could also be cached in memory (optional,
configurable). The FieldCache overhaul (LUCENE-831) with column fields
as source can help here.

After all this is implemented (column fields, UIDs, "read-only"
IndexReaders, FieldCache overhaul) I'd like to make the column fields
(and norms) updateable via IndexWriter.

OK lot's of food for thought.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: JBoss Cache as a store

Posted by Manik Surtani <ma...@jboss.org>.
Bump.  Anyone?


On 24 Jan 2008, at 14:07, Manik Surtani wrote:

> Hi guys
>
> I've just written a plugin for Lucene to use JBoss Cache as an index  
> store.  The benefits of something like this are:
>
> 1.  Faster access to indexes as they will be in memory
> 2.  Indexes replicated across a cluster of servers
> 3.  Indexes "persisted" in clustered memory - faster that  
> persistence to disk
>
> The implementation I have is pretty basic for now.
>
> Is there a set of tests in the Lucene sources I could use to test  
> the "JBCDirectory", as I call it?  Perhaps something way I could  
> change the "index store provider" and re-run some existing tests,  
> and perhaps add some clustered tests specific to my plugin?
>
> Finally, regarding hosting, I am happy to contribute this to Lucene  
> (alongside the JEDirectory, etc) but if licensing (JBoss Cache is  
> LGPL, although the plugin code can be ASL if need be) or language  
> levels (the plugin depends on JBoss Cache 2.x, which requires JDK 5)  
> then I'm happy to host the plugin externally.
>
> Cheers,
> --
> Manik Surtani
> Lead, JBoss Cache
> manik@jboss.org
>
>
>
>
>
>

--
Manik Surtani
Lead, JBoss Cache
manik@jboss.org







---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: JBoss Cache as a store

Posted by Karl Wettin <ka...@gmail.com>.
29 jan 2008 kl. 23.30 skrev Chris Hostetter:

> I think most of the existing tests have the Directory impl hardcoded  
> in
> them ... the best thing to do might be to refactor the existing  
> tests so
> Directory creation comes from an overridable function in a subclass...
> come ot think of it, Karl may have already done this as part of his

I did, but the patch was for old code and removed as an artifact as I  
came up with a simpler scheme, to populate my store with the contents  
of an FSDirectory and then assert the behaviour of two index readers.

See TestCompareIndices.java in LUCENE-550.


   karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: JBoss Cache as a store

Posted by Manik Surtani <ma...@jboss.org>.
On 29 Jan 2008, at 22:30, Chris Hostetter wrote:

>
> : Is there a set of tests in the Lucene sources I could use to test  
> the
> : "JBCDirectory", as I call it?  Perhaps something way I could  
> change the "index
> : store provider" and re-run some existing tests, and perhaps add  
> some clustered
> : tests specific to my plugin?
>
> I think most of the existing tests have the Directory impl hardcoded  
> in
> them ... the best thing to do might be to refactor the existing  
> tests so
> Directory creation comes from an overridable function in a subclass...
> come ot think of it, Karl may have already done this as part of his
> InstantiatedIndex patch (check jira) but i'm not sure ... the  
> conversation
> sounds familiar, but i think he was looking at facading the entire
> IndexReader impl not just the directory, so any refactoring approach  
> he
> might have taken may not have gone far enough to work in this case.
>
> It would certianly be nice if there was an easy way to run every  
> test in
> the test suite against an arbitrary Directory implementation.

Cool.  Well, for now, I'll follow Mark Harwood's recommendation to  
copy the relevant tests that use RAMDirectory and change the directory  
implementation.

>
>
> : Finally, regarding hosting, I am happy to contribute this to  
> Lucene (alongside
> : the JEDirectory, etc) but if licensing (JBoss Cache is LGPL,  
> although the
> : plugin code can be ASL if need be) or language levels (the plugin  
> depends on
> : JBoss Cache 2.x, which requires JDK 5) then I'm happy to host the  
> plugin
> : externally.
>
> contribs can run require 1.5 already ... an soon the trunk will move  
> to
> 1.5 so that's not really an issue, the licensing may be, but it  
> depends on
> how the integration with JBoss winds up working (ie: i don't know if
> having the build scripts download JBoss at build time to compile  
> against
> them is allowed or not)
>
>

Who would the best person be to contact about this?  I'm assuming this  
is not a problem since the JEDirectory pulls down BDBJE stuff which  
certainly isn't apache licensed.

Cheers,
--
Manik Surtani
Lead, JBoss Cache
manik@jboss.org







---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: JBoss Cache as a store

Posted by Chris Hostetter <ho...@fucit.org>.
: Is there a set of tests in the Lucene sources I could use to test the
: "JBCDirectory", as I call it?  Perhaps something way I could change the "index
: store provider" and re-run some existing tests, and perhaps add some clustered
: tests specific to my plugin?

I think most of the existing tests have the Directory impl hardcoded in 
them ... the best thing to do might be to refactor the existing tests so 
Directory creation comes from an overridable function in a subclass...  
come ot think of it, Karl may have already done this as part of his 
InstantiatedIndex patch (check jira) but i'm not sure ... the conversation 
sounds familiar, but i think he was looking at facading the entire 
IndexReader impl not just the directory, so any refactoring approach he 
might have taken may not have gone far enough to work in this case.

It would certianly be nice if there was an easy way to run every test in 
the test suite against an arbitrary Directory implementation.

: Finally, regarding hosting, I am happy to contribute this to Lucene (alongside
: the JEDirectory, etc) but if licensing (JBoss Cache is LGPL, although the
: plugin code can be ASL if need be) or language levels (the plugin depends on
: JBoss Cache 2.x, which requires JDK 5) then I'm happy to host the plugin
: externally.

contribs can run require 1.5 already ... an soon the trunk will move to 
1.5 so that's not really an issue, the licensing may be, but it depends on 
how the integration with JBoss winds up working (ie: i don't know if 
having the build scripts download JBoss at build time to compile against 
them is allowed or not)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


JBoss Cache as a store

Posted by Manik Surtani <ma...@jboss.org>.
Hi guys

I've just written a plugin for Lucene to use JBoss Cache as an index  
store.  The benefits of something like this are:

1.  Faster access to indexes as they will be in memory
2.  Indexes replicated across a cluster of servers
3.  Indexes "persisted" in clustered memory - faster that persistence  
to disk

The implementation I have is pretty basic for now.

Is there a set of tests in the Lucene sources I could use to test the  
"JBCDirectory", as I call it?  Perhaps something way I could change  
the "index store provider" and re-run some existing tests, and perhaps  
add some clustered tests specific to my plugin?

Finally, regarding hosting, I am happy to contribute this to Lucene  
(alongside the JEDirectory, etc) but if licensing (JBoss Cache is  
LGPL, although the plugin code can be ASL if need be) or language  
levels (the plugin depends on JBoss Cache 2.x, which requires JDK 5)  
then I'm happy to host the plugin externally.

Cheers,
--
Manik Surtani
Lead, JBoss Cache
manik@jboss.org







---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:

> On Jan 24, 2008 5:47 AM, Michael McCandless  
> <lu...@mikemccandless.com> wrote:
>>
>> Yonik Seeley wrote:
>>
>>> On Jan 23, 2008 6:34 AM, Michael McCandless
>>> <lu...@mikemccandless.com> wrote:
>>>>    writer.freezeDocIDs();
>>>>    try {
>>>>      get docIDs from somewhere & call writer.deleteByDocID
>>>>    } finally {
>>>>      writer.unfreezeDocIDs();
>>>>    }
>>>
>>> Interesting idea, but would require the IndexWriter to flush the
>>> buffered docs so an IndexReader could be created fro them.  (or  
>>> would
>>> require the existence of an UnflushedDocumentsIndexReader)
>>
>> True.
>>
>> Actually, an UnflushedDocumentsIndexReader would not be hard!
>>
>> DocumentsWriter already has an IndexInput (ByteSliceReader) that can
>> read the postings for a single term from the RAM buffer (this is used
>> when flushing the segment).  I think it'd be straightforward to get
>> TermEnum/TermDocs/TermPositions iterators on the buffered docs.
>> Norms are already stored as byte arrays in memory.  FieldInfos is
>> already available.  The stored fields & term vectors are already
>> flushed to the directory so they could be read normally.
>>
>> Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would
>> have to flush deleted terms (and queries, if we add that) before
>> making a reader accessible,
>
> If we buffer queries, that would seem to take care of 99% of the
> usecases that need an IndexReader, right?   A custom query could get
> ids from an index however it wanted.

I think so?

So, if we add only buffered "deleteByQuery" (and setNorm) to  
IndexWriter, is that enough to deprecate deleteDocument, setNorm in  
IndexReader?

>> though, the cost is shared because the
>> readers need to be opened anyway (so the app can find docIDs).
>>
>> So maybe this approach becomes this:
>>
>>    // Returns a "point in time" frozen view of index...
>>    IndexReader reader = writer.getReader();
>>    try {
>>      <get docIDs from reader, delete by docID>
>>   } finally {
>>      writer.releaseReader();
>>    }
>>
>> ?
>>
>> We may even be able to implement this w/o actually freezing the
>> writer,
>> ie, still allowing add/updateDocument calls to proceed.
>> Merging could certainly still proceed.  This way you could at any
>> time ask a writer for a "point in time" reader, independent of what
>> else you are doing with the writer.  This would require, on flushing,
>> that writer goes and swaps in a "real" segment reader, limited to a
>> specified docID, for any point in time readers that are open.
>
> Wow... sounds complex.

I think it may not be so bad ... the raw ingredients are already done  
(like ByteSliceReader) ... need to ponder it some more.

I think one very powerful side effect of doing this would be that you  
could have extremely low latency indexing ("highly interactive  
indexing").  You would add/delete docs using the writer, then quickly  
re-open the reader, and be able to search the buffered docs without  
the cost of flushing a new segment, assuming it's all within one JVM.

This reader (that searches both on-disk segments and the writer's  
buffered docs) would do reopen extremely efficiently.  In the  
[distant?] future, it could even do searching "live", meaning the  
full buffer is always searched rather than a point-in-time snapshot.   
But we couldn't really do this until we re-work the FieldCache API to  
belong to each segment & be incrementally updateable such that if a  
new doc is added to the writer, we could efficiently update the  
FieldCache, if present.  That would be a big change :)

Lots to think through ....

>>>> If we went that route, we'd need to expose methods in  
>>>> IndexWriter to
>>>> let you get reader(s), and, to then delete by docID.
>>>
>>> Right... I had envisioned a callback that was called after a new
>>> segment was created/flushed that passed IndexReader[].  In an
>>> environment of mixed deletes and adds, it would avoid slowing  
>>> down the
>>> indexing part by limiting where the deletes happen.
>>
>> This would certainly be less work :)  I guess the question is how
>> severely are we limiting the application by requiring that you can
>> only do deletes when IW decides to flush, or, by forcing the
>> application to flush when it wants to do deletes.
>
> Seems like more work, rather than limiting... "when" really isn't as
> important as long as it's before a new external IndexReader is opened
> for searching.

Right but if you want very low latency indexing (or even essentially  
0) then you can't really afford to buffer deletes (or adds) for that  
long...

>>> It does put a little more burden on the user, but a slightly harder
>>> (but more powerful / more efficient) API is preferable since easier
>>> APIs can always be built on top (but not vice-versa).
>>
>> True, though emulating the easier API on top of the "you get to
>> delete only when IW flushes" means you are forcing a flush, right?
>
> I was thinking via buffering (the same way term deletes are handled  
> now).
> You keep track of maxDoc() at the time of the delete and defer it  
> until later.

Oh, right, OK.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 24, 2008 5:47 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>
> Yonik Seeley wrote:
>
> > On Jan 23, 2008 6:34 AM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>    writer.freezeDocIDs();
> >>    try {
> >>      get docIDs from somewhere & call writer.deleteByDocID
> >>    } finally {
> >>      writer.unfreezeDocIDs();
> >>    }
> >
> > Interesting idea, but would require the IndexWriter to flush the
> > buffered docs so an IndexReader could be created fro them.  (or would
> > require the existence of an UnflushedDocumentsIndexReader)
>
> True.
>
> Actually, an UnflushedDocumentsIndexReader would not be hard!
>
> DocumentsWriter already has an IndexInput (ByteSliceReader) that can
> read the postings for a single term from the RAM buffer (this is used
> when flushing the segment).  I think it'd be straightforward to get
> TermEnum/TermDocs/TermPositions iterators on the buffered docs.
> Norms are already stored as byte arrays in memory.  FieldInfos is
> already available.  The stored fields & term vectors are already
> flushed to the directory so they could be read normally.
>
> Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would
> have to flush deleted terms (and queries, if we add that) before
> making a reader accessible,

If we buffer queries, that would seem to take care of 99% of the
usecases that need an IndexReader, right?   A custom query could get
ids from an index however it wanted.

> though, the cost is shared because the
> readers need to be opened anyway (so the app can find docIDs).
>
> So maybe this approach becomes this:
>
>    // Returns a "point in time" frozen view of index...
>    IndexReader reader = writer.getReader();
>    try {
>      <get docIDs from reader, delete by docID>
>   } finally {
>      writer.releaseReader();
>    }
>
> ?
>
> We may even be able to implement this w/o actually freezing the
> writer,
> ie, still allowing add/updateDocument calls to proceed.
> Merging could certainly still proceed.  This way you could at any
> time ask a writer for a "point in time" reader, independent of what
> else you are doing with the writer.  This would require, on flushing,
> that writer goes and swaps in a "real" segment reader, limited to a
> specified docID, for any point in time readers that are open.

Wow... sounds complex.

> >> If we went that route, we'd need to expose methods in IndexWriter to
> >> let you get reader(s), and, to then delete by docID.
> >
> > Right... I had envisioned a callback that was called after a new
> > segment was created/flushed that passed IndexReader[].  In an
> > environment of mixed deletes and adds, it would avoid slowing down the
> > indexing part by limiting where the deletes happen.
>
> This would certainly be less work :)  I guess the question is how
> severely are we limiting the application by requiring that you can
> only do deletes when IW decides to flush, or, by forcing the
> application to flush when it wants to do deletes.

Seems like more work, rather than limiting... "when" really isn't as
important as long as it's before a new external IndexReader is opened
for searching.

> > It does put a little more burden on the user, but a slightly harder
> > (but more powerful / more efficient) API is preferable since easier
> > APIs can always be built on top (but not vice-versa).
>
> True, though emulating the easier API on top of the "you get to
> delete only when IW flushes" means you are forcing a flush, right?

I was thinking via buffering (the same way term deletes are handled now).
You keep track of maxDoc() at the time of the delete and defer it until later.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Michael McCandless <lu...@mikemccandless.com>.
Yonik Seeley wrote:

> On Jan 23, 2008 6:34 AM, Michael McCandless  
> <lu...@mikemccandless.com> wrote:
>>    writer.freezeDocIDs();
>>    try {
>>      get docIDs from somewhere & call writer.deleteByDocID
>>    } finally {
>>      writer.unfreezeDocIDs();
>>    }
>
> Interesting idea, but would require the IndexWriter to flush the
> buffered docs so an IndexReader could be created fro them.  (or would
> require the existence of an UnflushedDocumentsIndexReader)

True.

Actually, an UnflushedDocumentsIndexReader would not be hard!

DocumentsWriter already has an IndexInput (ByteSliceReader) that can  
read the postings for a single term from the RAM buffer (this is used  
when flushing the segment).  I think it'd be straightforward to get  
TermEnum/TermDocs/TermPositions iterators on the buffered docs.   
Norms are already stored as byte arrays in memory.  FieldInfos is  
already available.  The stored fields & term vectors are already  
flushed to the directory so they could be read normally.

Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would  
have to flush deleted terms (and queries, if we add that) before  
making a reader accessible, though, the cost is shared because the  
readers need to be opened anyway (so the app can find docIDs).

So maybe this approach becomes this:

   // Returns a "point in time" frozen view of index...
   IndexReader reader = writer.getReader();
   try {
     <get docIDs from reader, delete by docID>
  } finally {
     writer.releaseReader();
   }

?

We may even be able to implement this w/o actually freezing the  
writer, ie, still allowing add/updateDocument calls to proceed.   
Merging could certainly still proceed.  This way you could at any  
time ask a writer for a "point in time" reader, independent of what  
else you are doing with the writer.  This would require, on flushing,  
that writer goes and swaps in a "real" segment reader, limited to a  
specified docID, for any point in time readers that are open.

>> If we went that route, we'd need to expose methods in IndexWriter to
>> let you get reader(s), and, to then delete by docID.
>
> Right... I had envisioned a callback that was called after a new
> segment was created/flushed that passed IndexReader[].  In an
> environment of mixed deletes and adds, it would avoid slowing down the
> indexing part by limiting where the deletes happen.

This would certainly be less work :)  I guess the question is how  
severely are we limiting the application by requiring that you can  
only do deletes when IW decides to flush, or, by forcing the  
application to flush when it wants to do deletes.

> It does put a little more burden on the user, but a slightly harder
> (but more powerful / more efficient) API is preferable since easier
> APIs can always be built on top (but not vice-versa).

True, though emulating the easier API on top of the "you get to  
delete only when IW flushes" means you are forcing a flush, right?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Yonik Seeley <yo...@apache.org>.
On Jan 23, 2008 6:34 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
>    writer.freezeDocIDs();
>    try {
>      get docIDs from somewhere & call writer.deleteByDocID
>    } finally {
>      writer.unfreezeDocIDs();
>    }

Interesting idea, but would require the IndexWriter to flush the
buffered docs so an IndexReader could be created fro them.  (or would
require the existence of an UnflushedDocumentsIndexReader)

> If we went that route, we'd need to expose methods in IndexWriter to
> let you get reader(s), and, to then delete by docID.

Right... I had envisioned a callback that was called after a new
segment was created/flushed that passed IndexReader[].  In an
environment of mixed deletes and adds, it would avoid slowing down the
indexing part by limiting where the deletes happen.

It does put a little more burden on the user, but a slightly harder
(but more powerful / more efficient) API is preferable since easier
APIs can always be built on top (but not vice-versa).

> I do like the idea of a UID field, but I'm a bit nervous about having
> the "core" maintain it

+1

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 23, 2008, at 6:34 AM, Michael McCandless wrote:

>
> At first it might be optional,

+1

There are still applications that don't require a UID, or are static  
for long enough periods of time that the Lucene internal id is  
sufficient, so I would hate to impose this on those apps.

I think the "per doc payloads" is a good idea, but I don't know if we  
need to provide explicit UID functionality on top of that.  Or, if we  
do, it could be an optional layer on top of the existing functionality.

-Grant 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Michael McCandless <lu...@mikemccandless.com>.
Michael,

Couldn't we add deleteByQuery to IndexWriter without adding the UID  
field?

Would that be "enough" to make IndexReader read-only (ie, do we still  
really need to delete by docID from IndexWriter?).

If we still need that ... maybe we could extend IndexWriter so that  
you can hold a lock on docIDs changing while you do your stuff, eg:

   writer.freezeDocIDs();
   try {
     get docIDs from somewhere & call writer.deleteByDocID
   } finally {
     writer.unfreezeDocIDs();
   }

If we went that route, we'd need to expose methods in IndexWriter to  
let you get reader(s), and, to then delete by docID.

I'm not certain this will work :)  I'm just throwing alternative  
ideas out...

I do like the idea of a UID field, but I'm a bit nervous about having  
the "core" maintain it and then have things in the core that depend  
on its presence.  At first it might be optional, but I could see us  
over time making more and more functionality that require UID to be  
present, to the point where it's eventually not really optional...

Mike

Michael Busch wrote:

> Paul Elschot wrote:
>> Michael,
>>
>> How would IndexWriter.addIndexes() work with unique doc ids?
>
> Hi Paul,
>
> it would probably be a limitation of this design. The only way I can
> think of right now to ensure that during an addIndexes() the UIDs  
> don't
> change is an API in IndexWriter like setMinUID(long). When you  
> create an
> index and you know that you'll add it to another one via addIndexes(),
> then you could use this method to set the min UID value in that  
> index to
> the max number of add/update operations you'd expect in the other  
> index.
>
> Please note that the UIDs that I'm thinking about here would actually
> not affect the index order. All postings would still be stored in
> (dynamic) doc id order.
> This means, with this design the search results would not be  
> returned in
> UID order, so the UIDs couldn't be used efficiently e. g. for a join
> operation with an external data structure (e. g. database). I think in
> this regard my proposed UID design differs from what was discussed  
> here
> some time ago.
>
> The main usecase here is to get rid of readers that do write  
> operations.
> I think that this would be very desireable when we implement  
> updateable
> column-fields. Then you could use the UIDs that an IndexReader  
> returned
> to delete or update docs or the column fields/norms, and you wouldn't
> have to worry about IndexReaders being "in sync" with the  
> IndexWriters.
>
> Maybe this UID design that I'm thinking out loudly here is total
> overkill for the mentioned use cases. I'm open and interested in other
> alternative ideas!
> -Michael
>
>
>>
>> Regards,
>> Paul Elschot
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Michael Busch <bu...@gmail.com>.
Paul Elschot wrote:
> Michael,
> 
> How would IndexWriter.addIndexes() work with unique doc ids?

Hi Paul,

it would probably be a limitation of this design. The only way I can
think of right now to ensure that during an addIndexes() the UIDs don't
change is an API in IndexWriter like setMinUID(long). When you create an
index and you know that you'll add it to another one via addIndexes(),
then you could use this method to set the min UID value in that index to
the max number of add/update operations you'd expect in the other index.

Please note that the UIDs that I'm thinking about here would actually
not affect the index order. All postings would still be stored in
(dynamic) doc id order.
This means, with this design the search results would not be returned in
UID order, so the UIDs couldn't be used efficiently e. g. for a join
operation with an external data structure (e. g. database). I think in
this regard my proposed UID design differs from what was discussed here
some time ago.

The main usecase here is to get rid of readers that do write operations.
I think that this would be very desireable when we implement updateable
column-fields. Then you could use the UIDs that an IndexReader returned
to delete or update docs or the column fields/norms, and you wouldn't
have to worry about IndexReaders being "in sync" with the IndexWriters.

Maybe this UID design that I'm thinking out loudly here is total
overkill for the mentioned use cases. I'm open and interested in other
alternative ideas!

-Michael


> 
> Regards,
> Paul Elschot
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Paul Elschot <pa...@xs4all.nl>.
Michael,

How would IndexWriter.addIndexes() work with unique doc ids?

Regards,
Paul Elschot


Op Tuesday 22 January 2008 12:07:16 schreef Michael Busch:
> Hi Team,
> 
> the question of how to delete with IndexWriter using doc ids is
> currently being discussed on java-user
> (http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
> thought this is a good time to mention an idea that I recently had. I'm
> planning to work on column-stored fields soon (I used to call them
> per-document payloads). Then we'll have the ability to store metadata
> for each document very efficiently in the index.
> 
> This new data structure could be used to store a unique ID for each doc
> in the index. The IndexReader would then get an API that provides a
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
> 
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID". This would allow
> to "delete by query" as well. The disadvantage is that the index would
> become bigger, but that should still be ok: 8 bytes per doc for the
> ID->UID map (assuming we took long for the UID, which I'd suggest). The
> UID->ID map might even be a bit smaller initially (using VInts and
> VLongs), but might become bigger when the index has lot's of deleted
> docs, because then the delta encoding wouldn't be as efficient anymore
> for the UIDs.
> 
> If RAM permits, the maps could also be cached in memory (optional,
> configurable). The FieldCache overhaul (LUCENE-831) with column fields
> as source can help here.
> 
> After all this is implemented (column fields, UIDs, "read-only"
> IndexReaders, FieldCache overhaul) I'd like to make the column fields
> (and norms) updateable via IndexWriter.
> 
> OK lot's of food for thought.
> 
> -Michael
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Michael Busch <bu...@gmail.com>.
Terry Yang wrote:
> Hi,Michael,
> You idea is good! But i have a question and thanks for your help!
> 

Hi Terry,

> Can u explain more about how you store a reverse UID-->ID?  How u guarantee
> UID
> can be mapped to the correct dynamic ID. I mean if a docid =5 and then for
> some reason changed to 60, but you still stored UID-->5 in a file/memory?
> 
> 

Good question!

You can think of a UID as a special, unique term that every document
has. Let's say we have the following segment:

S1:
UID -> ID
  0 ->  0
  1 ->  1
  2 ->  2

Now we flush the segment, add two docs, update the document with UID=2,
add another doc, and then we'll have these two segments:

S1:
UID -> ID
  0 ->  0
  1 ->  1 (deleted)
  2 ->  2

S12
UID -> ID
  1 ->  2
  3 ->  0
  4 ->  1
  5 ->  3

You can view the UIDs as terms with a posting list, each list containing
just one posting. Now we want to find the ID for UID=1: in the example
we have two segments with the same UID=1. However, we know that the doc
in S1 with ID=1 is deleted, so we keep looking in the other segment(s)
for the UID until we find one whose corresponding ID is not deleted.
There can only be one valid entry at any time for one UID.

Of course we shouldn't really use a term + postinglist for the UIDs,
because this would be quite inefficient with the data structures we
currently have. We wouldn't want to store the UIDs as Strings and we
wouldn't need to store e. g. freq or positions. Also we might be able to
implement some heuristics to optimize the order in which we iterate the
segments for the UID lookup.

I believe this should work?

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Unique doc ids

Posted by Terry Yang <my...@gmail.com>.
Hi,Michael,
You idea is good! But i have a question and thanks for your help!

How you plan to store a unique ID for each doc? My understanding will be
adding a field(i.e uniqueid) for each doc and the field has one identical
token value.
We can add unique ID as payload for that token before indexing. So we can
use IndexReader.termPositions() to get all the uniqueIDs and IDs.
Can u explain more about how you store a reverse UID-->ID?  How u guarantee
UID
can be mapped to the correct dynamic ID. I mean if a docid =5 and then for
some reason changed to 60, but you still stored UID-->5 in a file/memory?

On 1/22/08, Michael Busch <bu...@gmail.com> wrote:
> Hi Team,
>
> the question of how to delete with IndexWriter using doc ids is
> currently being discussed on java-user
> (http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
> thought this is a good time to mention an idea that I recently had. I'm
> planning to work on column-stored fields soon (I used to call them
> per-document payloads). Then we'll have the ability to store metadata
> for each document very efficiently in the index.
>
> This new data structure could be used to store a unique ID for each doc
> in the index. The IndexReader would then get an API that provides a
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
>
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID". This would allow
> to "delete by query" as well. The disadvantage is that the index would
> become bigger, but that should still be ok: 8 bytes per doc for the
> ID->UID map (assuming we took long for the UID, which I'd suggest). The
> UID->ID map might even be a bit smaller initially (using VInts and
> VLongs), but might become bigger when the index has lot's of deleted
> docs, because then the delta encoding wouldn't be as efficient anymore
> for the UIDs.
>
> If RAM permits, the maps could also be cached in memory (optional,
> configurable). The FieldCache overhaul (LUCENE-831) with column fields
> as source can help here.
>
> After all this is implemented (column fields, UIDs, "read-only"
> IndexReaders, FieldCache overhaul) I'd like to make the column fields
> (and norms) updateable via IndexWriter.
>
> OK lot's of food for thought.
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Unique doc ids

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
Hi Michael,

On Tue, Jan 22, 2008, Michael Busch wrote about "Unique doc ids":
> the question of how to delete with IndexWriter using doc ids is
>...
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID".

It sounds to me that this list would be split according to segments, right?
In that case, whoever wants to read this list will need to behave like a
IndexReader (which opens all segments), not an IndexWriter (which writes to
only one segment). So it still makes some sort of twisted sense to have
"delete by UID" in the indexReader (like delete document was originally).

In any case, I'm afraid I don't understand how your proposal to add special
"UIDs" differs from the existing situation, where you can put your UIDs in
a certain field (e.g., call that field "UID" if you want) and then you can
use IndexWriter.deleteDocuments(term) to delete documents (in your case,
just one) with this term. How is your new suggestion better, or more efficient?

> This would allow
> to "delete by query" as well.

Again, I don't understand how this makes a difference. In the existing Lucene,
you can also theoretically run a query, get a list of docids, and then delete
them all. I said "theoretically" because unfortunately, the current
IndexWriter interface doesn't support the necessary calls (either a
deleteDocuments(Query) or a deleteDocuments(int docid) call), but I don't
see why this can't be fixed without adding new concepts (like UID) to the
index. Or maybe I'm missing something?


-- 
Nadav Har'El                        |   Wednesday, Jan 23 2008, 17 Shevat 5768
IBM Haifa Research Lab              |-----------------------------------------
                                    |War doesn't determine who's right but
http://nadav.harel.org.il           |who's left.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org