You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Thang Luong Minh <lu...@gmail.com> on 2007/02/01 19:45:43 UTC

Advices on a replacement of Lucene gap encoding scheme?

Dear all

I am happy to send my first email to Lucene community after some time
standing aside, following many interesting discussions.

As part of my school project, I am intending to make some improvements in
Lucene source code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme.

I'm thinking of  replace the byte-aligned scheme with the "fixed binary"
coding scheme mentioned in the paper "Index compression using Fixed Binary
Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.html<http://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html>).
This scheme basically breaks the list of gaps into segments whose gaps (in
one segment) will be coded in a  fixed data width w (bits). The number of
gaps in each segment is recored in a span variable s, and the pair (w,s)
form a selector assigned for that segment. By effectively decompose the
list, reduce the number of selectors into 16 combinations of relative data
size (vs. previous segment) and span, and use greedy algorithm to find
suboptimal solutions, the authors claimed that they could achieve better
compression effectiveness (measured in bits per pointer averaged across the
wholde index), and retrieval time compared to Golomb, interpolative,
byte-aligned, and word aligned code schemes.

What I wonder at this time is that in the case of Lucene, how possible it is
to implement the "fixed binary" scheme that could enhance the performance,
and whether there are other parts which I could also consider replacing the
gap-encoding scheme.

As I've started playing around with Lucene recently, I hope to have your
helps to understand Lucene better  ^_^

PS: for this type of discussion, which mailing list is most appropriate for
my emai?

Best regards,

Luong Minh Thang

Advices on a replacement of Lucene gap encoding scheme?

Posted by Thang Luong Minh <lu...@gmail.com>.
Dear all,

I am happy to send my first email to Lucene community as after subscribing
to the mailing list, I haven't actually joined the community, just standing
aside and following many intersting threads.

As part of my school project, I am intending to make some improvements in
Lucene source code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme.

I'm thinking of  replace the byte-aligned scheme with the "fixed binary"
coding scheme mentioned in the paper "Index compression using Fixed Binary
Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.html<http://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html>).
This scheme basically breaks the list of gaps into segments whose gaps (in
one segment) will be coded in a  fixed data width w (bits). The number of
gaps in each segment is recored in a span variable s, and the pair (w,s)
form a selector assigned for that segment. By effectively decompose the
list, reduce the number of selectors into 16 combinations of relative data
size (vs. previous segment) and span, and use greedy algorithm to find
suboptimal solutions, the authors claimed that they could achieve better
compression effectiveness (measured in bits per pointer averaged across the
wholde index), and retrieval time compared to Golomb, interpolative,
byte-aligned, and word aligned code schemes.

What I wonder at this time is that in the case of Lucene, how possible it is
to implement the "fixed binary" scheme that could enhance the performance,
and whether there are other parts which I could also consider replacing the
gap-encoding scheme.

As I've started playing around with Lucene recently, I hope to have your
helps to understand Lucene better  ^_^

Best regards,

Luong Minh Thang

Re: Advices on a replacement of Lucene gap encoding scheme?

Posted by Thang Luong Minh <lu...@gmail.com>.
Hi,

Here is the link to the actual paper:
http://crpit.com/confpapers/CRPITV27Anh.pdf

Regards,

Thang.

On 2/2/07, Yonik Seeley <yo...@apache.org> wrote:
>
> On 2/1/07, Thang Luong Minh <lu...@gmail.com> wrote:
> > I'm thinking of  replace the byte-aligned scheme with the "fixed binary"
> > coding scheme mentioned in the paper "Index compression using Fixed
> Binary
> > Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
> > here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.html<
> http://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html>).
>
> Do you have an example, or a link to the actual paper?
> I can't tell what the actual encoding scheme is from your description.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Advices on a replacement of Lucene gap encoding scheme?

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Thang Luong Minh <lu...@gmail.com> wrote:
> I'm thinking of  replace the byte-aligned scheme with the "fixed binary"
> coding scheme mentioned in the paper "Index compression using Fixed Binary
> Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
> here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.html<http://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html>).

Do you have an example, or a link to the actual paper?
I can't tell what the actual encoding scheme is from your description.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org