You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vishwas Jain <vj...@gmail.com> on 2016/03/28 11:51:55 UTC

Compression algorithm for posting lists

Hello ,

          We are trying to implement better compression techniques in
lucene54 codec of Apache Lucene. Currently there is no such compression for
posting lists in lucene54 codec but LZ4 compression technique is used for
stored fields. Does anyone know why there is no compression technique for
postings lists? and what are the possible compression that would benefit if
implemented?

Thanks

RE: Compression algorithm for posting lists

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

> Hey Adrien,
>                  We are thinking of implementing XZ compression instead of
> LZ4 for stored fields. Will it serve our purpose of saving the disk space
> while trading off the speed? We were eager to know that why XZ
> compression
> is not given as an option for compression.

XZ is not a compression algorithm. XZ is just a container format for compressed streams, which can be anything, also lz4.
I think you mean lzma compression, which is the default inside XZ containers.

Lucene has chosen lz4 because of speed and space. But you are free to use any other compression algorithm like lzma if you implement a custom codec. But I don't think it's worth the trouble.

Uwe

> Thanks
> Are posting lists the biggest disk user of your index? Usually it is rather
> stored fields or term vectors. You can tell Lucene to compress stored
> fields more aggressively by passing BEST_COMPRESSION to the
> Lucene54Codec
> constructor. Also maybe there are some features of the index that you do
> not need, that you could disable and save space. For instance, if you do
> not run phrase queries, you could disable the indexing of positions, and if
> you do not need scoring, you could disable norms. Finally, sparsity is
> something that Lucene does not handle well, and at times this can cause
> disk requirements to increase significantly.
> 
> Le jeu. 31 mars 2016 à 14:08, Vishwas Jain <vj...@gmail.com> a écrit :
> 
> > ​Hi Adrien,
> >                Thanks for the help, actually we are trying to compress
> ​the
> > actual posting lists. Our main aim here is to save the disk space as much
> > as possible occupied by the index created. Is only compressing the posting
> > lists will suffice the problem or we have to explore more options?
> >
> > Yours,
> > Vishwas Jain
> > 13CS10053
> > Computer Science and Engineering
> > IIT Kharagpur
> >
> > Contact - +91 9800168231
> >
> > On Tue, Mar 29, 2016 at 1:41 PM, Adrien Grand <jp...@gmail.com>
> wrote:
> >
> > > BlockTreeTermsWriter.TermsWriter.finish writes a FST that serves as an
> > > index of the terms dictionary. It will be used at search time when
> > seeking
> > > terms in the terms dictionary.
> > >
> > > Le lun. 28 mars 2016 à 14:02, Vishwas Jain <vj...@gmail.com> a écrit
> > :
> > >
> > > > Thanks for the reply and information.
> > > >               I have some doubts regarding the implemenation of
> > lucene54
> > > > codec when writing the posting lists using the lucene50
> > postinglistwriter
> > > > while going through the code. What exactly does the finish() method in
> > > the
> > > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have
> come
> > > to
> > > > undertstand that the posting lists(document ID, frequency, etc) is
> > mainly
> > > > is mainly written using WriteBlock method in the ForUtil.java file...
> > > >
> > > > Thanks..
> > > >
> > > > On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vj...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks for the reply and information.
> > > > >               I have some doubts regarding the implemenation of
> > > lucene54
> > > > > codec when writing the posting lists using the lucene50
> > > postinglistwriter
> > > > > while going through the code. What exactly does the finish() method
> > in
> > > > the
> > > > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have
> > come
> > > > to
> > > > > undertstand that the posting lists(document ID, frequency, etc) is
> > > mainly
> > > > > is mainly written using WriteBlock method in the ForUtil.java
> file...
> > > > >
> > > > > Thanks..
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer
> <gbowyer@fastmail.co.uk
> > >
> > > > > wrote:
> > > > >
> > > > >> The posting list is compressed using a specialised technique aimed
> > at
> > > > >> pure numbers. Currently the codec uses a variant of Patched Frame
> of
> > > > >> Reference coding to perform this compression.
> > > > >>
> > > > >> A good survey of such techniques can be found in the good IR books
> > > > >> (https://mitpress.mit.edu/books/information-retrieval,
> > > > >>
> > > > >>
> > > >
> > >
> >
> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-
> Information/dp/1558605703
> > > > >> ,
> > > > >> http://nlp.stanford.edu/IR-book/) as well as this paper
> > > > >> http://eprints.gla.ac.uk/93572/1/93572.pdf.
> > > > >>
> > > > >> Interestingly, there are potentially some wins in finding better
> > > integer
> > > > >> codings (and one of my personal projects is aimed at doing exactly
> > > > >> this), but I doubt LZ4 compressing the posting list would help all
> > > that
> > > > >> much.
> > > > >>
> > > > >> Hope this helps
> > > > >>
> > > > >> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> > > > >> > Hello ,
> > > > >> >
> > > > >> >           We are trying to implement better compression
> techniques
> > > in
> > > > >> > lucene54 codec of Apache Lucene. Currently there is no such
> > > > compression
> > > > >> > for
> > > > >> > posting lists in lucene54 codec but LZ4 compression technique is
> > > used
> > > > >> for
> > > > >> > stored fields. Does anyone know why there is no compression
> > > technique
> > > > >> for
> > > > >> > postings lists? and what are the possible compression that would
> > > > benefit
> > > > >> > if
> > > > >> > implemented?
> > > > >> >
> > > > >> > Thanks
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-
> help@lucene.apache.org
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Compression algorithm for posting lists

Posted by Vishwas Jain <vj...@gmail.com>.
Hey Adrien,
                 We are thinking of implementing XZ compression instead of
LZ4 for stored fields. Will it serve our purpose of saving the disk space
while trading off the speed? We were eager to know that why XZ compression
is not given as an option for compression.

Thanks
Are posting lists the biggest disk user of your index? Usually it is rather
stored fields or term vectors. You can tell Lucene to compress stored
fields more aggressively by passing BEST_COMPRESSION to the Lucene54Codec
constructor. Also maybe there are some features of the index that you do
not need, that you could disable and save space. For instance, if you do
not run phrase queries, you could disable the indexing of positions, and if
you do not need scoring, you could disable norms. Finally, sparsity is
something that Lucene does not handle well, and at times this can cause
disk requirements to increase significantly.

Le jeu. 31 mars 2016 à 14:08, Vishwas Jain <vj...@gmail.com> a écrit :

> ​Hi Adrien,
>                Thanks for the help, actually we are trying to compress
​the
> actual posting lists. Our main aim here is to save the disk space as much
> as possible occupied by the index created. Is only compressing the posting
> lists will suffice the problem or we have to explore more options?
>
> Yours,
> Vishwas Jain
> 13CS10053
> Computer Science and Engineering
> IIT Kharagpur
>
> Contact - +91 9800168231
>
> On Tue, Mar 29, 2016 at 1:41 PM, Adrien Grand <jp...@gmail.com> wrote:
>
> > BlockTreeTermsWriter.TermsWriter.finish writes a FST that serves as an
> > index of the terms dictionary. It will be used at search time when
> seeking
> > terms in the terms dictionary.
> >
> > Le lun. 28 mars 2016 à 14:02, Vishwas Jain <vj...@gmail.com> a écrit
> :
> >
> > > Thanks for the reply and information.
> > >               I have some doubts regarding the implemenation of
> lucene54
> > > codec when writing the posting lists using the lucene50
> postinglistwriter
> > > while going through the code. What exactly does the finish() method in
> > the
> > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have
come
> > to
> > > undertstand that the posting lists(document ID, frequency, etc) is
> mainly
> > > is mainly written using WriteBlock method in the ForUtil.java file...
> > >
> > > Thanks..
> > >
> > > On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vj...@gmail.com>
> > wrote:
> > >
> > > > Thanks for the reply and information.
> > > >               I have some doubts regarding the implemenation of
> > lucene54
> > > > codec when writing the posting lists using the lucene50
> > postinglistwriter
> > > > while going through the code. What exactly does the finish() method
> in
> > > the
> > > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have
> come
> > > to
> > > > undertstand that the posting lists(document ID, frequency, etc) is
> > mainly
> > > > is mainly written using WriteBlock method in the ForUtil.java
file...
> > > >
> > > > Thanks..
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gbowyer@fastmail.co.uk
> >
> > > > wrote:
> > > >
> > > >> The posting list is compressed using a specialised technique aimed
> at
> > > >> pure numbers. Currently the codec uses a variant of Patched Frame
of
> > > >> Reference coding to perform this compression.
> > > >>
> > > >> A good survey of such techniques can be found in the good IR books
> > > >> (https://mitpress.mit.edu/books/information-retrieval,
> > > >>
> > > >>
> > >
> >
>
http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
> > > >> ,
> > > >> http://nlp.stanford.edu/IR-book/) as well as this paper
> > > >> http://eprints.gla.ac.uk/93572/1/93572.pdf.
> > > >>
> > > >> Interestingly, there are potentially some wins in finding better
> > integer
> > > >> codings (and one of my personal projects is aimed at doing exactly
> > > >> this), but I doubt LZ4 compressing the posting list would help all
> > that
> > > >> much.
> > > >>
> > > >> Hope this helps
> > > >>
> > > >> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> > > >> > Hello ,
> > > >> >
> > > >> >           We are trying to implement better compression
techniques
> > in
> > > >> > lucene54 codec of Apache Lucene. Currently there is no such
> > > compression
> > > >> > for
> > > >> > posting lists in lucene54 codec but LZ4 compression technique is
> > used
> > > >> for
> > > >> > stored fields. Does anyone know why there is no compression
> > technique
> > > >> for
> > > >> > postings lists? and what are the possible compression that would
> > > benefit
> > > >> > if
> > > >> > implemented?
> > > >> >
> > > >> > Thanks
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Compression algorithm for posting lists

Posted by Adrien Grand <jp...@gmail.com>.
Are posting lists the biggest disk user of your index? Usually it is rather
stored fields or term vectors. You can tell Lucene to compress stored
fields more aggressively by passing BEST_COMPRESSION to the Lucene54Codec
constructor. Also maybe there are some features of the index that you do
not need, that you could disable and save space. For instance, if you do
not run phrase queries, you could disable the indexing of positions, and if
you do not need scoring, you could disable norms. Finally, sparsity is
something that Lucene does not handle well, and at times this can cause
disk requirements to increase significantly.

Le jeu. 31 mars 2016 à 14:08, Vishwas Jain <vj...@gmail.com> a écrit :

> ​Hi Adrien,
>                Thanks for the help, actually we are trying to compress ​the
> actual posting lists. Our main aim here is to save the disk space as much
> as possible occupied by the index created. Is only compressing the posting
> lists will suffice the problem or we have to explore more options?
>
> Yours,
> Vishwas Jain
> 13CS10053
> Computer Science and Engineering
> IIT Kharagpur
>
> Contact - +91 9800168231
>
> On Tue, Mar 29, 2016 at 1:41 PM, Adrien Grand <jp...@gmail.com> wrote:
>
> > BlockTreeTermsWriter.TermsWriter.finish writes a FST that serves as an
> > index of the terms dictionary. It will be used at search time when
> seeking
> > terms in the terms dictionary.
> >
> > Le lun. 28 mars 2016 à 14:02, Vishwas Jain <vj...@gmail.com> a écrit
> :
> >
> > > Thanks for the reply and information.
> > >               I have some doubts regarding the implemenation of
> lucene54
> > > codec when writing the posting lists using the lucene50
> postinglistwriter
> > > while going through the code. What exactly does the finish() method in
> > the
> > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have come
> > to
> > > undertstand that the posting lists(document ID, frequency, etc) is
> mainly
> > > is mainly written using WriteBlock method in the ForUtil.java file...
> > >
> > > Thanks..
> > >
> > > On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vj...@gmail.com>
> > wrote:
> > >
> > > > Thanks for the reply and information.
> > > >               I have some doubts regarding the implemenation of
> > lucene54
> > > > codec when writing the posting lists using the lucene50
> > postinglistwriter
> > > > while going through the code. What exactly does the finish() method
> in
> > > the
> > > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have
> come
> > > to
> > > > undertstand that the posting lists(document ID, frequency, etc) is
> > mainly
> > > > is mainly written using WriteBlock method in the ForUtil.java file...
> > > >
> > > > Thanks..
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gbowyer@fastmail.co.uk
> >
> > > > wrote:
> > > >
> > > >> The posting list is compressed using a specialised technique aimed
> at
> > > >> pure numbers. Currently the codec uses a variant of Patched Frame of
> > > >> Reference coding to perform this compression.
> > > >>
> > > >> A good survey of such techniques can be found in the good IR books
> > > >> (https://mitpress.mit.edu/books/information-retrieval,
> > > >>
> > > >>
> > >
> >
> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
> > > >> ,
> > > >> http://nlp.stanford.edu/IR-book/) as well as this paper
> > > >> http://eprints.gla.ac.uk/93572/1/93572.pdf.
> > > >>
> > > >> Interestingly, there are potentially some wins in finding better
> > integer
> > > >> codings (and one of my personal projects is aimed at doing exactly
> > > >> this), but I doubt LZ4 compressing the posting list would help all
> > that
> > > >> much.
> > > >>
> > > >> Hope this helps
> > > >>
> > > >> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> > > >> > Hello ,
> > > >> >
> > > >> >           We are trying to implement better compression techniques
> > in
> > > >> > lucene54 codec of Apache Lucene. Currently there is no such
> > > compression
> > > >> > for
> > > >> > posting lists in lucene54 codec but LZ4 compression technique is
> > used
> > > >> for
> > > >> > stored fields. Does anyone know why there is no compression
> > technique
> > > >> for
> > > >> > postings lists? and what are the possible compression that would
> > > benefit
> > > >> > if
> > > >> > implemented?
> > > >> >
> > > >> > Thanks
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Compression algorithm for posting lists

Posted by Vishwas Jain <vj...@gmail.com>.
​Hi Adrien,
               Thanks for the help, actually we are trying to compress ​the
actual posting lists. Our main aim here is to save the disk space as much
as possible occupied by the index created. Is only compressing the posting
lists will suffice the problem or we have to explore more options?

Yours,
Vishwas Jain
13CS10053
Computer Science and Engineering
IIT Kharagpur

Contact - +91 9800168231

On Tue, Mar 29, 2016 at 1:41 PM, Adrien Grand <jp...@gmail.com> wrote:

> BlockTreeTermsWriter.TermsWriter.finish writes a FST that serves as an
> index of the terms dictionary. It will be used at search time when seeking
> terms in the terms dictionary.
>
> Le lun. 28 mars 2016 à 14:02, Vishwas Jain <vj...@gmail.com> a écrit :
>
> > Thanks for the reply and information.
> >               I have some doubts regarding the implemenation of lucene54
> > codec when writing the posting lists using the lucene50 postinglistwriter
> > while going through the code. What exactly does the finish() method in
> the
> > TermsWriter class of the BlockTreeTermsWriter.java file do? I have come
> to
> > undertstand that the posting lists(document ID, frequency, etc) is mainly
> > is mainly written using WriteBlock method in the ForUtil.java file...
> >
> > Thanks..
> >
> > On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vj...@gmail.com>
> wrote:
> >
> > > Thanks for the reply and information.
> > >               I have some doubts regarding the implemenation of
> lucene54
> > > codec when writing the posting lists using the lucene50
> postinglistwriter
> > > while going through the code. What exactly does the finish() method in
> > the
> > > TermsWriter class of the BlockTreeTermsWriter.java file do? I have come
> > to
> > > undertstand that the posting lists(document ID, frequency, etc) is
> mainly
> > > is mainly written using WriteBlock method in the ForUtil.java file...
> > >
> > > Thanks..
> > >
> > >
> > >
> > >
> > > On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gb...@fastmail.co.uk>
> > > wrote:
> > >
> > >> The posting list is compressed using a specialised technique aimed at
> > >> pure numbers. Currently the codec uses a variant of Patched Frame of
> > >> Reference coding to perform this compression.
> > >>
> > >> A good survey of such techniques can be found in the good IR books
> > >> (https://mitpress.mit.edu/books/information-retrieval,
> > >>
> > >>
> >
> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
> > >> ,
> > >> http://nlp.stanford.edu/IR-book/) as well as this paper
> > >> http://eprints.gla.ac.uk/93572/1/93572.pdf.
> > >>
> > >> Interestingly, there are potentially some wins in finding better
> integer
> > >> codings (and one of my personal projects is aimed at doing exactly
> > >> this), but I doubt LZ4 compressing the posting list would help all
> that
> > >> much.
> > >>
> > >> Hope this helps
> > >>
> > >> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> > >> > Hello ,
> > >> >
> > >> >           We are trying to implement better compression techniques
> in
> > >> > lucene54 codec of Apache Lucene. Currently there is no such
> > compression
> > >> > for
> > >> > posting lists in lucene54 codec but LZ4 compression technique is
> used
> > >> for
> > >> > stored fields. Does anyone know why there is no compression
> technique
> > >> for
> > >> > postings lists? and what are the possible compression that would
> > benefit
> > >> > if
> > >> > implemented?
> > >> >
> > >> > Thanks
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >
> >
>

Re: Compression algorithm for posting lists

Posted by Adrien Grand <jp...@gmail.com>.
BlockTreeTermsWriter.TermsWriter.finish writes a FST that serves as an
index of the terms dictionary. It will be used at search time when seeking
terms in the terms dictionary.

Le lun. 28 mars 2016 à 14:02, Vishwas Jain <vj...@gmail.com> a écrit :

> Thanks for the reply and information.
>               I have some doubts regarding the implemenation of lucene54
> codec when writing the posting lists using the lucene50 postinglistwriter
> while going through the code. What exactly does the finish() method in the
> TermsWriter class of the BlockTreeTermsWriter.java file do? I have come to
> undertstand that the posting lists(document ID, frequency, etc) is mainly
> is mainly written using WriteBlock method in the ForUtil.java file...
>
> Thanks..
>
> On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vj...@gmail.com> wrote:
>
> > Thanks for the reply and information.
> >               I have some doubts regarding the implemenation of lucene54
> > codec when writing the posting lists using the lucene50 postinglistwriter
> > while going through the code. What exactly does the finish() method in
> the
> > TermsWriter class of the BlockTreeTermsWriter.java file do? I have come
> to
> > undertstand that the posting lists(document ID, frequency, etc) is mainly
> > is mainly written using WriteBlock method in the ForUtil.java file...
> >
> > Thanks..
> >
> >
> >
> >
> > On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gb...@fastmail.co.uk>
> > wrote:
> >
> >> The posting list is compressed using a specialised technique aimed at
> >> pure numbers. Currently the codec uses a variant of Patched Frame of
> >> Reference coding to perform this compression.
> >>
> >> A good survey of such techniques can be found in the good IR books
> >> (https://mitpress.mit.edu/books/information-retrieval,
> >>
> >>
> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
> >> ,
> >> http://nlp.stanford.edu/IR-book/) as well as this paper
> >> http://eprints.gla.ac.uk/93572/1/93572.pdf.
> >>
> >> Interestingly, there are potentially some wins in finding better integer
> >> codings (and one of my personal projects is aimed at doing exactly
> >> this), but I doubt LZ4 compressing the posting list would help all that
> >> much.
> >>
> >> Hope this helps
> >>
> >> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> >> > Hello ,
> >> >
> >> >           We are trying to implement better compression techniques in
> >> > lucene54 codec of Apache Lucene. Currently there is no such
> compression
> >> > for
> >> > posting lists in lucene54 codec but LZ4 compression technique is used
> >> for
> >> > stored fields. Does anyone know why there is no compression technique
> >> for
> >> > postings lists? and what are the possible compression that would
> benefit
> >> > if
> >> > implemented?
> >> >
> >> > Thanks
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>

Re: Compression algorithm for posting lists

Posted by Vishwas Jain <vj...@gmail.com>.
Thanks for the reply and information.
              I have some doubts regarding the implemenation of lucene54
codec when writing the posting lists using the lucene50 postinglistwriter
while going through the code. What exactly does the finish() method in the
TermsWriter class of the BlockTreeTermsWriter.java file do? I have come to
undertstand that the posting lists(document ID, frequency, etc) is mainly
is mainly written using WriteBlock method in the ForUtil.java file...

Thanks..

On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vj...@gmail.com> wrote:

> Thanks for the reply and information.
>               I have some doubts regarding the implemenation of lucene54
> codec when writing the posting lists using the lucene50 postinglistwriter
> while going through the code. What exactly does the finish() method in the
> TermsWriter class of the BlockTreeTermsWriter.java file do? I have come to
> undertstand that the posting lists(document ID, frequency, etc) is mainly
> is mainly written using WriteBlock method in the ForUtil.java file...
>
> Thanks..
>
>
>
>
> On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gb...@fastmail.co.uk>
> wrote:
>
>> The posting list is compressed using a specialised technique aimed at
>> pure numbers. Currently the codec uses a variant of Patched Frame of
>> Reference coding to perform this compression.
>>
>> A good survey of such techniques can be found in the good IR books
>> (https://mitpress.mit.edu/books/information-retrieval,
>>
>> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
>> ,
>> http://nlp.stanford.edu/IR-book/) as well as this paper
>> http://eprints.gla.ac.uk/93572/1/93572.pdf.
>>
>> Interestingly, there are potentially some wins in finding better integer
>> codings (and one of my personal projects is aimed at doing exactly
>> this), but I doubt LZ4 compressing the posting list would help all that
>> much.
>>
>> Hope this helps
>>
>> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
>> > Hello ,
>> >
>> >           We are trying to implement better compression techniques in
>> > lucene54 codec of Apache Lucene. Currently there is no such compression
>> > for
>> > posting lists in lucene54 codec but LZ4 compression technique is used
>> for
>> > stored fields. Does anyone know why there is no compression technique
>> for
>> > postings lists? and what are the possible compression that would benefit
>> > if
>> > implemented?
>> >
>> > Thanks
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Compression algorithm for posting lists

Posted by Vishwas Jain <vj...@gmail.com>.
Thanks for the reply and information.
              I have some doubts regarding the implemenation of lucene54
codec when writing the posting lists using the lucene50 postinglistwriter
while going through the code. What exactly does the finish() method in the
TermsWriter class of the BlockTreeTermsWriter.java file do? I have come to
undertstand that the posting lists(document ID, frequency, etc) is mainly
is mainly written using WriteBlock method in the ForUtil.java file...

Thanks..




On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gb...@fastmail.co.uk> wrote:

> The posting list is compressed using a specialised technique aimed at
> pure numbers. Currently the codec uses a variant of Patched Frame of
> Reference coding to perform this compression.
>
> A good survey of such techniques can be found in the good IR books
> (https://mitpress.mit.edu/books/information-retrieval,
>
> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
> ,
> http://nlp.stanford.edu/IR-book/) as well as this paper
> http://eprints.gla.ac.uk/93572/1/93572.pdf.
>
> Interestingly, there are potentially some wins in finding better integer
> codings (and one of my personal projects is aimed at doing exactly
> this), but I doubt LZ4 compressing the posting list would help all that
> much.
>
> Hope this helps
>
> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> > Hello ,
> >
> >           We are trying to implement better compression techniques in
> > lucene54 codec of Apache Lucene. Currently there is no such compression
> > for
> > posting lists in lucene54 codec but LZ4 compression technique is used for
> > stored fields. Does anyone know why there is no compression technique for
> > postings lists? and what are the possible compression that would benefit
> > if
> > implemented?
> >
> > Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Compression algorithm for posting lists

Posted by Greg Bowyer <gb...@fastmail.co.uk>.
The posting list is compressed using a specialised technique aimed at
pure numbers. Currently the codec uses a variant of Patched Frame of
Reference coding to perform this compression. 

A good survey of such techniques can be found in the good IR books
(https://mitpress.mit.edu/books/information-retrieval,
http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703,
http://nlp.stanford.edu/IR-book/) as well as this paper
http://eprints.gla.ac.uk/93572/1/93572.pdf.

Interestingly, there are potentially some wins in finding better integer
codings (and one of my personal projects is aimed at doing exactly
this), but I doubt LZ4 compressing the posting list would help all that
much.

Hope this helps

On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> Hello ,
> 
>           We are trying to implement better compression techniques in
> lucene54 codec of Apache Lucene. Currently there is no such compression
> for
> posting lists in lucene54 codec but LZ4 compression technique is used for
> stored fields. Does anyone know why there is no compression technique for
> postings lists? and what are the possible compression that would benefit
> if
> implemented?
> 
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org