You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Carlos Gonzalez-Cadenas <cg...@experienceon.com> on 2012/04/12 12:13:16 UTC

codecs for sorted indexes

Hello,

We're using a sorted index in order to implement early termination
efficiently over an index of hundreds of millions of documents. As of now,
we're using the default codecs coming with Lucene 4, but we believe that
due to the fact that the docids are sorted, we should be able to do much
better in terms of storage and achieve much better performance, especially
decompression performance.

In particular, Robert Muir is commenting on these lines here:

https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

We're aware that the in the bulkpostings branch there are different codecs
being implemented and different experiments being done. We don't know
whether we should implement our own codec (i.e. using some RLE-like
techniques) or we should use one of the codecs implemented there (PFOR,
Simple64, ...).

Can you please give us some advice on this?

Thanks
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

Re: codecs for sorted indexes

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas
<cg...@experienceon.com> wrote:
> Hello Michael,
>
> Yes, we are pre-sorting the documents before adding them to the index. We
> have a score associated to every document (not an IR score but a
> document-related score that reflects its "importance"). Therefore, the
> document with the biggest score will have the lowest docid (we add it first
> to the index). We do this in order to apply early termination effectively.
> With the actual coded, we haven't seen much of a difference in terms of
> space when we have the index sorted vs not sorted.

I wouldn't expect that you will see space savings when you sort this way.

The techniques I was mentioning involve sorting documents by other
factors instead (such as grouping related documents from the same
website together: idea being they probably share many of the same
terms): this hopefully creates smaller document deltas that require
less bits to represent.

-- 
lucidimagination.com

Re: codecs for sorted indexes

Posted by Carlos Gonzalez-Cadenas <cg...@experienceon.com>.
Hello Michael,

Yes, we are pre-sorting the documents before adding them to the index. We
have a score associated to every document (not an IR score but a
document-related score that reflects its "importance"). Therefore, the
document with the biggest score will have the lowest docid (we add it first
to the index). We do this in order to apply early termination effectively.
With the actual coded, we haven't seen much of a difference in terms of
space when we have the index sorted vs not sorted.

So, the question would be: if we force the docids to be sorted, what is the
best way to encode them?. We don't really care if the codec doesn't work
for cases where the documents are not sorted (i.e. if it throws an
exception if documents are not ordered when creating the index). Our idea
here is that it may be possible to trade off generality but achieve very
significant improvements for the specific case.

Would something along the lines of RLE coding work? i.e. if we have to
store docids 1 to 1500, we can represent it as "1::1499" (it would be 2
ints to represent 1500 docids).

Thanks a lot for your help,
Carlos

On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Do you mean you are pre-sorting the documents (by what criteria?)
> yourself, before adding them to the index?



> In which case... you should already be seeing some benefits (smaller
> index size) than had you "randomly" added them (ie the vInts should
> take fewer bytes), I think.  (Probably the savings would be greater
> for better intblock codecs like PForDelta, SimpleX, but I'm not
> sure...).
>
> Or do you mean having a codec re-sort the documents (on flush/merge)?
> I think this should be possible w/ the Codec API... but nobody has
> tried it yet that I know of.
>
> Note that the bulkpostings branch is effectively dead (nobody is
> iterating on it, and we've removed the old bulk API from trunk), but
> there is likely a GSoC project to add a PForDelta codec to trunk:
>
>    https://issues.apache.org/jira/browse/LUCENE-3892
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
> <cg...@experienceon.com> wrote:
> > Hello,
> >
> > We're using a sorted index in order to implement early termination
> > efficiently over an index of hundreds of millions of documents. As of
> now,
> > we're using the default codecs coming with Lucene 4, but we believe that
> > due to the fact that the docids are sorted, we should be able to do much
> > better in terms of storage and achieve much better performance,
> especially
> > decompression performance.
> >
> > In particular, Robert Muir is commenting on these lines here:
> >
> >
> https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411
> >
> > We're aware that the in the bulkpostings branch there are different
> codecs
> > being implemented and different experiments being done. We don't know
> > whether we should implement our own codec (i.e. using some RLE-like
> > techniques) or we should use one of the codecs implemented there (PFOR,
> > Simple64, ...).
> >
> > Can you please give us some advice on this?
> >
> > Thanks
> > Carlos
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>

Re: codecs for sorted indexes

Posted by Michael McCandless <lu...@mikemccandless.com>.
Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?

In which case... you should already be seeing some benefits (smaller
index size) than had you "randomly" added them (ie the vInts should
take fewer bytes), I think.  (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
sure...).

Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.

Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:

    https://issues.apache.org/jira/browse/LUCENE-3892

Mike McCandless

http://blog.mikemccandless.com



On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
<cg...@experienceon.com> wrote:
> Hello,
>
> We're using a sorted index in order to implement early termination
> efficiently over an index of hundreds of millions of documents. As of now,
> we're using the default codecs coming with Lucene 4, but we believe that
> due to the fact that the docids are sorted, we should be able to do much
> better in terms of storage and achieve much better performance, especially
> decompression performance.
>
> In particular, Robert Muir is commenting on these lines here:
>
> https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411
>
> We're aware that the in the bulkpostings branch there are different codecs
> being implemented and different experiments being done. We don't know
> whether we should implement our own codec (i.e. using some RLE-like
> techniques) or we should use one of the codecs implemented there (PFOR,
> Simple64, ...).
>
> Can you please give us some advice on this?
>
> Thanks
> Carlos
>
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
>
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas