You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ard Schrijvers <a....@hippo.nl> on 2008/01/22 14:13:03 UTC

FW: Unique doc ids

>From the lucene dev list: 

Jukka pointed me lately to per-document payloads. I just happened to
read the mail below on the lucene dev-list. Might be an interesting
development for us to follow. Think the hierarchy resolver might benefit
from uid doc ids (where we now loose the computed cached hierarchy after
index merging) and there probably are quite some more interesting usages

-Ard

> 
> Hi Team,
> 
> the question of how to delete with IndexWriter using doc ids 
> is currently being discussed on java-user 
> (http://www.gossamer-threads.com/lists/lucene/java-user/57228)
> , so I thought this is a good time to mention an idea that I 
> recently had. I'm planning to work on column-stored fields 
> soon (I used to call them per-document payloads). Then we'll 
> have the ability to store metadata for each document very 
> efficiently in the index.
> 
> This new data structure could be used to store a unique ID 
> for each doc in the index. The IndexReader would then get an 
> API that provides a mapping from the dynamic doc ids to the 
> new unique ones. We would also have to store a reverse 
> mapping (UID -> ID) in the index - we could use a VInt list + 
> skip list for that.
> 
> Then we should be able to make IndexReaders "read-only" 
> (LUCENE-1030) and provide a new API in IndexWriter "delete by 
> UID". This would allow to "delete by query" as well. The 
> disadvantage is that the index would become bigger, but that 
> should still be ok: 8 bytes per doc for the
> ID->UID map (assuming we took long for the UID, which I'd 
> suggest). The
> UID->ID map might even be a bit smaller initially (using VInts and
> VLongs), but might become bigger when the index has lot's of 
> deleted docs, because then the delta encoding wouldn't be as 
> efficient anymore for the UIDs.
> 
> If RAM permits, the maps could also be cached in memory 
> (optional, configurable). The FieldCache overhaul 
> (LUCENE-831) with column fields as source can help here.
> 
> After all this is implemented (column fields, UIDs, "read-only"
> IndexReaders, FieldCache overhaul) I'd like to make the 
> column fields (and norms) updateable via IndexWriter.
> 
> OK lot's of food for thought.
> 
> -Michael
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>