You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Olivier Binda <ol...@wanadoo.fr> on 2014/05/12 18:01:22 UTC

remapping docIds in a read only offline built index

In a 1-segment (parallel) read-only index, that is built offline once 
(and then frozen),
is it possible to remap the docIds as the last step (i.e... to have the 
exact same index, except that the docIds are all equal to the ord the 
docs where added to the index) ?

Say I have the read only index

docId   : document
1 : bookB
2 : sentenceB
3 : linkA
4 : linkC
5 : sentenceC
6 : sentenceA
7 : bookA
...
300000 : linkD

I would like to have instead the read-only index

docId   : document
1 : bookA
2 : bookB
....

M : linkA
M+1: linkB
...
N+1 : sentenceA
N+2 : sentenceB
...
300000:sentenceZZZ

This would allow me to reduce the amount of ram to cache the type of 
each document

-> without remapping, I need at least log2(types)* documents bits
here 2 * 300000 bits

-> with remapping, I need only to remember ints M and N

Also, if I need to cache 1 byte of metadata for each book

-> without remapping, I would need 1 byte * documents
here 300000 bytes

-> with remapping, I would only need 1 byte * books
here M - 1 bytes


I tried building such an index with 
LogMergePolicy/NoMergePolicy/extending the ram buffer but (maybee I did 
something wrong),
the docIds were always reshuffled (maybee because my index was big and I 
was over a threshold)



Best regards,
Olivier

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: remapping docIds in a read only offline built index

Posted by Olivier Binda <ol...@wanadoo.fr>.

Very nice ! That is exactly what I needed. Thank you very much !


On 06/02/2014 09:26 AM, Michael McCandless wrote:
> The index sorting APIs (in lucene/misc) can do this.  E.g. you could
> make a SortingAtomicReader, with your sort criteria, then use
> addIndexes(IR[]) to add it to a new index.  That resulting index would
> have 1 segment and the docIDs would be in your order.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, May 12, 2014 at 12:01 PM, Olivier Binda
> <ol...@wanadoo.fr> wrote:
>> In a 1-segment (parallel) read-only index, that is built offline once (and
>> then frozen),
>> is it possible to remap the docIds as the last step (i.e... to have the
>> exact same index, except that the docIds are all equal to the ord the docs
>> where added to the index) ?
>>
>> Say I have the read only index
>>
>> docId   : document
>> 1 : bookB
>> 2 : sentenceB
>> 3 : linkA
>> 4 : linkC
>> 5 : sentenceC
>> 6 : sentenceA
>> 7 : bookA
>> ...
>> 300000 : linkD
>>
>> I would like to have instead the read-only index
>>
>> docId   : document
>> 1 : bookA
>> 2 : bookB
>> ....
>>
>> M : linkA
>> M+1: linkB
>> ...
>> N+1 : sentenceA
>> N+2 : sentenceB
>> ...
>> 300000:sentenceZZZ
>>
>> This would allow me to reduce the amount of ram to cache the type of each
>> document
>>
>> -> without remapping, I need at least log2(types)* documents bits
>> here 2 * 300000 bits
>>
>> -> with remapping, I need only to remember ints M and N
>>
>> Also, if I need to cache 1 byte of metadata for each book
>>
>> -> without remapping, I would need 1 byte * documents
>> here 300000 bytes
>>
>> -> with remapping, I would only need 1 byte * books
>> here M - 1 bytes
>>
>>
>> I tried building such an index with LogMergePolicy/NoMergePolicy/extending
>> the ram buffer but (maybee I did something wrong),
>> the docIds were always reshuffled (maybee because my index was big and I was
>> over a threshold)
>>
>>
>>
>> Best regards,
>> Olivier
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: remapping docIds in a read only offline built index

Posted by Michael McCandless <lu...@mikemccandless.com>.

The index sorting APIs (in lucene/misc) can do this.  E.g. you could
make a SortingAtomicReader, with your sort criteria, then use
addIndexes(IR[]) to add it to a new index.  That resulting index would
have 1 segment and the docIDs would be in your order.

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 12, 2014 at 12:01 PM, Olivier Binda
<ol...@wanadoo.fr> wrote:
> In a 1-segment (parallel) read-only index, that is built offline once (and
> then frozen),
> is it possible to remap the docIds as the last step (i.e... to have the
> exact same index, except that the docIds are all equal to the ord the docs
> where added to the index) ?
>
> Say I have the read only index
>
> docId   : document
> 1 : bookB
> 2 : sentenceB
> 3 : linkA
> 4 : linkC
> 5 : sentenceC
> 6 : sentenceA
> 7 : bookA
> ...
> 300000 : linkD
>
> I would like to have instead the read-only index
>
> docId   : document
> 1 : bookA
> 2 : bookB
> ....
>
> M : linkA
> M+1: linkB
> ...
> N+1 : sentenceA
> N+2 : sentenceB
> ...
> 300000:sentenceZZZ
>
> This would allow me to reduce the amount of ram to cache the type of each
> document
>
> -> without remapping, I need at least log2(types)* documents bits
> here 2 * 300000 bits
>
> -> with remapping, I need only to remember ints M and N
>
> Also, if I need to cache 1 byte of metadata for each book
>
> -> without remapping, I would need 1 byte * documents
> here 300000 bytes
>
> -> with remapping, I would only need 1 byte * books
> here M - 1 bytes
>
>
> I tried building such an index with LogMergePolicy/NoMergePolicy/extending
> the ram buffer but (maybee I did something wrong),
> the docIds were always reshuffled (maybee because my index was big and I was
> over a threshold)
>
>
>
> Best regards,
> Olivier
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: remapping docIds in a read only offline built index

Posted by Olivier Binda <ol...@wanadoo.fr>.

Hello, I'm still interested in having the answer to the following question :

In a 1-segment read-only index (that is built offline once and then 
frozen), is it possible to remap the docIds ?



I may have a (working but not optimal) answer to my original problem : I 
may use a MultiReader and 3 index to get the following composite index

docId   : document
-------------------------
1             : bookA
2             : bookB
....

M            : linkA
M+1       : linkB
...
N+1       :  sentenceA
N+2       : sentenceB
...
300000 :sentenceZZZ


This solution should be slower that if I only built 1 index while having 
the docId equal to the order in which I added the documents.










On 05/12/2014 06:01 PM, Olivier Binda wrote:
> In a 1-segment (parallel) read-only index, that is built offline once 
> (and then frozen),
> is it possible to remap the docIds as the last step (i.e... to have 
> the exact same index, except that the docIds are all equal to the ord 
> the docs where added to the index) ?
>
> Say I have the read only index
>
> docId   : document
> 1 : bookB
> 2 : sentenceB
> 3 : linkA
> 4 : linkC
> 5 : sentenceC
> 6 : sentenceA
> 7 : bookA
> ...
> 300000 : linkD
>
> I would like to have instead the read-only index
>
> docId   : document
> 1 : bookA
> 2 : bookB
> ....
>
> M : linkA
> M+1: linkB
> ...
> N+1 : sentenceA
> N+2 : sentenceB
> ...
> 300000:sentenceZZZ
>
> This would allow me to reduce the amount of ram to cache the type of 
> each document
>
> -> without remapping, I need at least log2(types)* documents bits
> here 2 * 300000 bits
>
> -> with remapping, I need only to remember ints M and N
>
> Also, if I need to cache 1 byte of metadata for each book
>
> -> without remapping, I would need 1 byte * documents
> here 300000 bytes
>
> -> with remapping, I would only need 1 byte * books
> here M - 1 bytes
>
>
> I tried building such an index with 
> LogMergePolicy/NoMergePolicy/extending the ram buffer but (maybee I 
> did something wrong),
> the docIds were always reshuffled (maybee because my index was big and 
> I was over a threshold)
>
>
>
> Best regards,
> Olivier
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>