You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by John Wang <jo...@gmail.com> on 2008/09/10 23:08:40 UTC

docid set compression and boolean docid set operations

Hi guys:

     We have build this on top of the lucene 1.4. api/refactoring for docid
sets and docIdIterater.

     We've implemented the p4Delta compression algorithm presented at
www2008: http://www2008.org/papers/fp618.html

     We've been using this in production here at LinkedIn and would love to
contribute it into lucene.

     We currently open sourced it at:
http://code.google.com/p/lucene-ext/wiki/Kamikaze

     Please let us know if it is thing you guys want to proceed, if so, what
are the steps we should take.

Thanks

-John

Re: docid set compression and boolean docid set operations

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Monday 15 September 2008 18:58:50 schreef John Wang:
> HI Paul and Eks:
>      Thanks for your responses.
>
>      Kamikaze is split into 2 parts: compression(p4delta) and set
> boolean logic docIdSet iterators (OR, AND, NOT).
>
>      I do see the patch for the DisjunctionDocsetIterator, I agree
> that there is no need to replicate the effort.

A bit of duplication is not bad, especially when the new code is
faster than the existing code.

> What about AND and 
> NOT?

In the patch at LUCENE-1345, AND with some DocIdSetIterators
and at least one Scorer is built into ConjunctionScorer.
NOT is done by ReqExclScorer, in the patch an excluding
Scorer is changed to an excluding DocIdSetIterator.

As DocIdSetIterator is the superclass of Scorer, it's possible
that some of the ...Scorer functionality in the patch could
be factored out into a ...DISI (DocIdSetIterator) superclass,
but that is still open.

This focus on Scorers is because these are currently used for
BooleanQueries.

Btw, there is also some in place boolean logic in
OpenBitSetDISI in the o.a.l.util package in the trunk.

> Also, do you know when these patches will be applied? 

As soon as there is something useful in there for the current
code base, I'd hope.

>     Are you guys interested in helping out on kamikaze?

I am.

Regards,
Paul Elschot


> Thanks
>
> -John
>
> On Sat, Sep 13, 2008 at 1:45 AM, Paul Elschot 
<pa...@xs4all.nl>wrote:
> > Op Saturday 13 September 2008 09:21:21 schreef Anmol Bhasin:
> > > Hi,
> > >
> > >  Michael :
> > >
> > > True, the table is a placeholder right now. I will run my
> > > performance tests and update the table in the next day or two.
> > >
> > > Paul :
> > >
> > >  Thanks for skimming over the code. As John mentioned in his
> > > email, we currently use Kamikaze for in memory caching for
> > > document hits. The Kamikaze project is aimed to provide a docset
> > > implementation using either an Integer array, an OpenBitSet or
> > > P4Delta data stucture using the OpenBitSet(lucene)  depending on 
> > > the size and range of the docset. There is a utility function
> > > exposed in
> > > com.kamikaze.docidset.utils. DocSetFactory.java which can decide
> > > on the fly given the parameters as to what underlying
> > > datastructure should be employed, however it is not tested
> > > thoroughly.
> >
> > Such a factory is precisely what is left open here:
> > https://issues.apache.org/jira/browse/LUCENE-1296
> > Basically, we need a subclass of CachingWrapperFilter to implement
> > such a factory.
> >
> > > Moreover, we have implementations of Logic operators on the
> > > underlying docsets which act like filters in Lucene. Again, these
> > > can be employed to perform complex logic ops on the underlying
> > > docsets which in turn could themselves be composite docsets
> > > generated using AND|OR|NOT operations. The implementations do not
> > > materialize the interim or final structures but simply expose an
> > > iterator to walk the docsets.
> >
> > Further to what Eks said, the patch at LUCENE-1345 extends
> > existing lucene scorers to walk docsets. It does this by allowing
> > a mix of DocIdSetIterators (for Filters) and Scorers (for Queries).
> > The class structure for this is still in its infancy, I'm trying to
> > figure out how much inheritance to use to implement the
> > various Scorers as subclasses of DocIdSetIterators.
> >
> > > It would be wonderful if the indexing structure can be augumented
> > > using Kamikaze. I can start
> > >  proactively modifying/improving the implementation/packaging to
> > > ease the process.
> >
> > Be sure to take small steps. The p4delta code from kamikaze that I
> > have seen so far still needs some performance improvements.
> > For example the main decompression loop contains an if statement
> > for the exceptions, but the whole point of p4delta decompression is
> > it to avoid that. Nevertheless, this implementation has the
> > advantage of being relatively easy to understand, so it could be
> > very useful for testing.
> > I don't know whether I have seen all of the code, though.
> >
> > > As for storing the term frequencies and positions with
> > > this datastructure, let me revisit the literature to see how
> > > best, if at all we can assimilate them.
> >
> > This fits nicely with the recent flexible indexing efforts.
> > Most of the performance improvements are are reported from the
> > positions, so we might try and start there. Alternatively, to get
> > going, the p4delta data structure might be initially used to
> > support boolean docid set operations.
> >
> > Regards,
> > Paul Elschot
> >
> > > Thanks for the interest in Kamikaze and I would keep you posted
> > > once I have updated performance numbers on the wiki.
> > >
> > > Thanks,
> > > Anmol
> > >
> > >
> > >
> > >
> > >
> > > com.kamikaze.docidset.utils
> > >
> > > On Thu, Sep 11, 2008 at 5:23 PM, John Wang <jo...@gmail.com>
> >
> > wrote:
> > > > Hi guys:
> > > >
> > > >     I will let the author, Anmol Bhasin to respond with
> > > > details.
> > > >
> > > >     In our use case, we are not making changes to the index
> > > > because we do not want to diverge from the lucene code base.
> > > > (thought it'd be great if we can enhance indexing structure
> > > > with this) We load the docIdSets into memory for caching
> > > > reasons.
> > > >
> > > > Thanks
> > > >
> > > > -John
> > > >
> > > > On Thu, Sep 11, 2008 at 3:28 AM, Paul Elschot
> > > > <pa...@xs4all.nl>
> > > >
> > > > wrote:
> > > >> John,
> > > >>
> > > >> I've taken a first look at the code, and I have a few
> > > >> questions.
> > > >>
> > > >> Did I understand correctly that it is basically a two way
> > > >> conversion between an integer array and an (Open)BitSet
> > > >> representing a p4delta data structure?
> > > >>
> > > >> In that case it would still be necessary to extend the lucene
> > > >> index structure to make it understand the p4delta data
> > > >> structure at the appropriate places.
> > > >> I can help getting the code integrated into lucene, but I've
> > > >> never done an index structure extension, so I'd like to have
> > > >> some support from this list for that.
> > > >> The code would initially need some package restructuring and
> > > >> layout changes, and then it could move forward to an index
> > > >> structure extension.
> > > >>
> > > >> Would you have some ideas on how to use de p4delta structure
> > > >> to store docIds, term frequencies and term positions?
> > > >> The references give some insights there, but it seems that
> > > >> there is still quite a bit of work to do get such "details"
> > > >> right. Fortunately, the existing Lucene TermDocs and
> > > >> TermPositions appear to be just right for this.
> > > >>
> > > >> Regards,
> > > >> Paul Elschot
> > > >>
> > > >> Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
> > > >> > Sorry, I meant lucene 2.4
> > > >> >
> > > >> > -John
> > > >> >
> > > >> > On Wed, Sep 10, 2008 at 2:08 PM, John Wang
> > > >> > <jo...@gmail.com>
> > > >>
> > > >> wrote:
> > > >> > > Hi guys:
> > > >> > >
> > > >> > >      We have build this on top of the lucene 1.4.
> > > >> > > api/refactoring for docid sets and docIdIterater.
> > > >> > >
> > > >> > >      We've implemented the p4Delta compression algorithm
> > > >> > > presented at www2008: http://www2008.org/papers/fp618.html
> > > >> > >
> > > >> > >      We've been using this in production here at LinkedIn
> > > >> > > and would love to contribute it into lucene.
> > > >> > >
> > > >> > >      We currently open sourced it at:
> > > >> > > http://code.google.com/p/lucene-ext/wiki/Kamikaze
> > > >> > >
> > > >> > >      Please let us know if it is thing you guys want to
> > > >> > > proceed, if so, what are the steps we should take.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > > -John
> > > >>
> > > >> --------------------------------------------------------------
> > > >>---- --- To unsubscribe, e-mail:
> > > >> java-dev-unsubscribe@lucene.apache.org For additional
> > > >> commands, e-mail: java-dev-help@lucene.apache.org
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: docid set compression and boolean docid set operations

Posted by John Wang <jo...@gmail.com>.

HI Paul and Eks:
     Thanks for your responses.

     Kamikaze is split into 2 parts: compression(p4delta) and set boolean
logic docIdSet iterators (OR, AND, NOT).

     I do see the patch for the DisjunctionDocsetIterator, I agree that
there is no need to replicate the effort. What about AND and NOT? Also, do
you know when these patches will be applied?

    Are you guys interested in helping out on kamikaze?

Thanks

-John

On Sat, Sep 13, 2008 at 1:45 AM, Paul Elschot <pa...@xs4all.nl>wrote:

>
> Op Saturday 13 September 2008 09:21:21 schreef Anmol Bhasin:
> > Hi,
> >
> >  Michael :
> >
> > True, the table is a placeholder right now. I will run my performance
> > tests and update the table in the next day or two.
> >
> > Paul :
> >
> >  Thanks for skimming over the code. As John mentioned in his email,
> > we currently use Kamikaze for in memory caching for document hits.
> > The Kamikaze project is aimed to provide a docset implementation
> > using either an Integer array, an OpenBitSet or P4Delta data stucture
> > using the OpenBitSet(lucene)  depending on  the size and range of the
> > docset. There is a utility function exposed in
> > com.kamikaze.docidset.utils. DocSetFactory.java which can decide on
> > the fly given the parameters as to what underlying datastructure
> > should be employed, however it is not tested thoroughly.
>
> Such a factory is precisely what is left open here:
> https://issues.apache.org/jira/browse/LUCENE-1296
> Basically, we need a subclass of CachingWrapperFilter to implement
> such a factory.
>
> >
> > Moreover, we have implementations of Logic operators on the
> > underlying docsets which act like filters in Lucene. Again, these can
> > be employed to perform complex logic ops on the underlying docsets
> > which in turn could themselves be composite docsets generated using
> > AND|OR|NOT operations. The implementations do not materialize the
> > interim or final structures but simply expose an iterator to walk the
> > docsets.
>
> Further to what Eks said, the patch at LUCENE-1345 extends
> existing lucene scorers to walk docsets. It does this by allowing
> a mix of DocIdSetIterators (for Filters) and Scorers (for Queries).
> The class structure for this is still in its infancy, I'm trying to
> figure out how much inheritance to use to implement the
> various Scorers as subclasses of DocIdSetIterators.
>
> > It would be wonderful if the indexing structure can be augumented
> > using Kamikaze. I can start
> >  proactively modifying/improving the implementation/packaging to ease
> > the process.
>
> Be sure to take small steps. The p4delta code from kamikaze that I
> have seen so far still needs some performance improvements.
> For example the main decompression loop contains an if statement
> for the exceptions, but the whole point of p4delta decompression is
> it to avoid that. Nevertheless, this implementation has the advantage
> of being relatively easy to understand, so it could be very useful for
> testing.
> I don't know whether I have seen all of the code, though.
>
> > As for storing the term frequencies and positions with
> > this datastructure, let me revisit the literature to see how best, if
> > at all we can assimilate them.
>
> This fits nicely with the recent flexible indexing efforts.
> Most of the performance improvements are are reported from the
> positions, so we might try and start there. Alternatively, to get going,
> the p4delta data structure might be initially used to support boolean
> docid set operations.
>
> Regards,
> Paul Elschot
>
>
> > Thanks for the interest in Kamikaze and I would keep you posted once
> > I have updated performance numbers on the wiki.
> >
> > Thanks,
> > Anmol
> >
> >
> >
> >
> >
> > com.kamikaze.docidset.utils
> >
> > On Thu, Sep 11, 2008 at 5:23 PM, John Wang <jo...@gmail.com>
> wrote:
> > > Hi guys:
> > >
> > >     I will let the author, Anmol Bhasin to respond with details.
> > >
> > >     In our use case, we are not making changes to the index because
> > > we do not want to diverge from the lucene code base. (thought it'd
> > > be great if we can enhance indexing structure with this) We load
> > > the docIdSets into memory for caching reasons.
> > >
> > > Thanks
> > >
> > > -John
> > >
> > > On Thu, Sep 11, 2008 at 3:28 AM, Paul Elschot
> > > <pa...@xs4all.nl>
> > >
> > > wrote:
> > >> John,
> > >>
> > >> I've taken a first look at the code, and I have a few questions.
> > >>
> > >> Did I understand correctly that it is basically a two way
> > >> conversion between an integer array and an (Open)BitSet
> > >> representing a p4delta data structure?
> > >>
> > >> In that case it would still be necessary to extend the lucene
> > >> index structure to make it understand the p4delta data structure
> > >> at the appropriate places.
> > >> I can help getting the code integrated into lucene, but I've
> > >> never done an index structure extension, so I'd like to have
> > >> some support from this list for that.
> > >> The code would initially need some package restructuring and
> > >> layout changes, and then it could move forward to an index
> > >> structure extension.
> > >>
> > >> Would you have some ideas on how to use de p4delta structure
> > >> to store docIds, term frequencies and term positions?
> > >> The references give some insights there, but it seems that there
> > >> is still quite a bit of work to do get such "details" right.
> > >> Fortunately, the existing Lucene TermDocs and TermPositions
> > >> appear to be just right for this.
> > >>
> > >> Regards,
> > >> Paul Elschot
> > >>
> > >> Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
> > >> > Sorry, I meant lucene 2.4
> > >> >
> > >> > -John
> > >> >
> > >> > On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com>
> > >>
> > >> wrote:
> > >> > > Hi guys:
> > >> > >
> > >> > >      We have build this on top of the lucene 1.4.
> > >> > > api/refactoring for docid sets and docIdIterater.
> > >> > >
> > >> > >      We've implemented the p4Delta compression algorithm
> > >> > > presented at www2008: http://www2008.org/papers/fp618.html
> > >> > >
> > >> > >      We've been using this in production here at LinkedIn and
> > >> > > would love to contribute it into lucene.
> > >> > >
> > >> > >      We currently open sourced it at:
> > >> > > http://code.google.com/p/lucene-ext/wiki/Kamikaze
> > >> > >
> > >> > >      Please let us know if it is thing you guys want to
> > >> > > proceed, if so, what are the steps we should take.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > -John
> > >>
> > >> ------------------------------------------------------------------
> > >>--- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: docid set compression and boolean docid set operations

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Saturday 13 September 2008 09:21:21 schreef Anmol Bhasin:
> Hi,
>
>  Michael :
>
> True, the table is a placeholder right now. I will run my performance
> tests and update the table in the next day or two.
>
> Paul :
>
>  Thanks for skimming over the code. As John mentioned in his email,
> we currently use Kamikaze for in memory caching for document hits.
> The Kamikaze project is aimed to provide a docset implementation
> using either an Integer array, an OpenBitSet or P4Delta data stucture
> using the OpenBitSet(lucene)  depending on  the size and range of the
> docset. There is a utility function exposed in
> com.kamikaze.docidset.utils. DocSetFactory.java which can decide on
> the fly given the parameters as to what underlying datastructure
> should be employed, however it is not tested thoroughly.

Such a factory is precisely what is left open here:
https://issues.apache.org/jira/browse/LUCENE-1296
Basically, we need a subclass of CachingWrapperFilter to implement
such a factory.

>
> Moreover, we have implementations of Logic operators on the
> underlying docsets which act like filters in Lucene. Again, these can
> be employed to perform complex logic ops on the underlying docsets
> which in turn could themselves be composite docsets generated using
> AND|OR|NOT operations. The implementations do not materialize the
> interim or final structures but simply expose an iterator to walk the
> docsets.

Further to what Eks said, the patch at LUCENE-1345 extends
existing lucene scorers to walk docsets. It does this by allowing
a mix of DocIdSetIterators (for Filters) and Scorers (for Queries).
The class structure for this is still in its infancy, I'm trying to 
figure out how much inheritance to use to implement the
various Scorers as subclasses of DocIdSetIterators.

> It would be wonderful if the indexing structure can be augumented
> using Kamikaze. I can start
>  proactively modifying/improving the implementation/packaging to ease
> the process.

Be sure to take small steps. The p4delta code from kamikaze that I
have seen so far still needs some performance improvements.
For example the main decompression loop contains an if statement
for the exceptions, but the whole point of p4delta decompression is
it to avoid that. Nevertheless, this implementation has the advantage
of being relatively easy to understand, so it could be very useful for
testing.
I don't know whether I have seen all of the code, though.

> As for storing the term frequencies and positions with 
> this datastructure, let me revisit the literature to see how best, if
> at all we can assimilate them.

This fits nicely with the recent flexible indexing efforts.
Most of the performance improvements are are reported from the
positions, so we might try and start there. Alternatively, to get going,
the p4delta data structure might be initially used to support boolean
docid set operations.

Regards,
Paul Elschot


> Thanks for the interest in Kamikaze and I would keep you posted once
> I have updated performance numbers on the wiki.
>
> Thanks,
> Anmol
>
>
>
>
>
> com.kamikaze.docidset.utils
>
> On Thu, Sep 11, 2008 at 5:23 PM, John Wang <jo...@gmail.com> 
wrote:
> > Hi guys:
> >
> >     I will let the author, Anmol Bhasin to respond with details.
> >
> >     In our use case, we are not making changes to the index because
> > we do not want to diverge from the lucene code base. (thought it'd
> > be great if we can enhance indexing structure with this) We load
> > the docIdSets into memory for caching reasons.
> >
> > Thanks
> >
> > -John
> >
> > On Thu, Sep 11, 2008 at 3:28 AM, Paul Elschot
> > <pa...@xs4all.nl>
> >
> > wrote:
> >> John,
> >>
> >> I've taken a first look at the code, and I have a few questions.
> >>
> >> Did I understand correctly that it is basically a two way
> >> conversion between an integer array and an (Open)BitSet
> >> representing a p4delta data structure?
> >>
> >> In that case it would still be necessary to extend the lucene
> >> index structure to make it understand the p4delta data structure
> >> at the appropriate places.
> >> I can help getting the code integrated into lucene, but I've
> >> never done an index structure extension, so I'd like to have
> >> some support from this list for that.
> >> The code would initially need some package restructuring and
> >> layout changes, and then it could move forward to an index
> >> structure extension.
> >>
> >> Would you have some ideas on how to use de p4delta structure
> >> to store docIds, term frequencies and term positions?
> >> The references give some insights there, but it seems that there
> >> is still quite a bit of work to do get such "details" right.
> >> Fortunately, the existing Lucene TermDocs and TermPositions
> >> appear to be just right for this.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >> Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
> >> > Sorry, I meant lucene 2.4
> >> >
> >> > -John
> >> >
> >> > On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com>
> >>
> >> wrote:
> >> > > Hi guys:
> >> > >
> >> > >      We have build this on top of the lucene 1.4.
> >> > > api/refactoring for docid sets and docIdIterater.
> >> > >
> >> > >      We've implemented the p4Delta compression algorithm
> >> > > presented at www2008: http://www2008.org/papers/fp618.html
> >> > >
> >> > >      We've been using this in production here at LinkedIn and
> >> > > would love to contribute it into lucene.
> >> > >
> >> > >      We currently open sourced it at:
> >> > > http://code.google.com/p/lucene-ext/wiki/Kamikaze
> >> > >
> >> > >      Please let us know if it is thing you guys want to
> >> > > proceed, if so, what are the steps we should take.
> >> > >
> >> > > Thanks
> >> > >
> >> > > -John
> >>
> >> ------------------------------------------------------------------
> >>--- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: docid set compression and boolean docid set operations

Posted by Anmol Bhasin <an...@gmail.com>.

Hi,

 Michael :

True, the table is a placeholder right now. I will run my performance
tests and update the table in the next day or two.

Paul :

 Thanks for skimming over the code. As John mentioned in his email, we
currently use Kamikaze for in memory caching for document hits. The
Kamikaze project is aimed to provide a docset implementation using
either an Integer array, an OpenBitSet or P4Delta data stucture using
the OpenBitSet(lucene)  depending on  the size and range of the
docset. There is a utility function exposed in
com.kamikaze.docidset.utils. DocSetFactory.java which can decide on
the fly given the parameters as to what underlying datastructure
should be employed, however it is not tested thoroughly.

Moreover, we have implementations of Logic operators on the underlying
docsets which act like filters in Lucene. Again, these can be employed
to perform complex logic ops on the underlying docsets which in turn
could themselves be composite docsets generated using AND|OR|NOT
operations. The implementations do not materialize the interim or
final structures but simply expose an iterator to walk the docsets.

It would be wonderful if the indexing structure can be augumented
using Kamikaze. I can start
 proactively modifying/improving the implementation/packaging to ease
the process. As for storing the term frequencies and positions with
this datastructure, let me revisit the literature to see how best, if
at all we can assimilate them.

Thanks for the interest in Kamikaze and I would keep you posted once I
have updated performance numbers on the wiki.

Thanks,
Anmol

com.kamikaze.docidset.utils

On Thu, Sep 11, 2008 at 5:23 PM, John Wang <jo...@gmail.com> wrote:
> Hi guys:
>
>     I will let the author, Anmol Bhasin to respond with details.
>
>     In our use case, we are not making changes to the index because we do
> not want to diverge from the lucene code base. (thought it'd be great if we
> can enhance indexing structure with this) We load the docIdSets into memory
> for caching reasons.
>
> Thanks
>
> -John
>
> On Thu, Sep 11, 2008 at 3:28 AM, Paul Elschot <pa...@xs4all.nl>
> wrote:
>>
>> John,
>>
>> I've taken a first look at the code, and I have a few questions.
>>
>> Did I understand correctly that it is basically a two way
>> conversion between an integer array and an (Open)BitSet
>> representing a p4delta data structure?
>>
>> In that case it would still be necessary to extend the lucene
>> index structure to make it understand the p4delta data structure
>> at the appropriate places.
>> I can help getting the code integrated into lucene, but I've
>> never done an index structure extension, so I'd like to have
>> some support from this list for that.
>> The code would initially need some package restructuring and
>> layout changes, and then it could move forward to an index
>> structure extension.
>>
>> Would you have some ideas on how to use de p4delta structure
>> to store docIds, term frequencies and term positions?
>> The references give some insights there, but it seems that there
>> is still quite a bit of work to do get such "details" right.
>> Fortunately, the existing Lucene TermDocs and TermPositions
>> appear to be just right for this.
>>
>> Regards,
>> Paul Elschot
>>
>>
>> Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
>> > Sorry, I meant lucene 2.4
>> >
>> > -John
>> >
>> > On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com>
>> wrote:
>> > > Hi guys:
>> > >
>> > >      We have build this on top of the lucene 1.4. api/refactoring
>> > > for docid sets and docIdIterater.
>> > >
>> > >      We've implemented the p4Delta compression algorithm presented
>> > > at www2008: http://www2008.org/papers/fp618.html
>> > >
>> > >      We've been using this in production here at LinkedIn and would
>> > > love to contribute it into lucene.
>> > >
>> > >      We currently open sourced it at:
>> > > http://code.google.com/p/lucene-ext/wiki/Kamikaze
>> > >
>> > >      Please let us know if it is thing you guys want to proceed, if
>> > > so, what are the steps we should take.
>> > >
>> > > Thanks
>> > >
>> > > -John
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

-- 
Courage doesn't always roar. Sometimes courage is the quiet voice at
the end of the day saying, "I will try again tomorrow"

Anmol Bhasin
SSE Data Platform
www.linkedin.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: docid set compression and boolean docid set operations

Posted by John Wang <jo...@gmail.com>.

Hi guys:

    I will let the author, Anmol Bhasin to respond with details.

    In our use case, we are not making changes to the index because we do
not want to diverge from the lucene code base. (thought it'd be great if we
can enhance indexing structure with this) We load the docIdSets into memory
for caching reasons.

Thanks

-John

On Thu, Sep 11, 2008 at 3:28 AM, Paul Elschot <pa...@xs4all.nl>wrote:

> John,
>
> I've taken a first look at the code, and I have a few questions.
>
> Did I understand correctly that it is basically a two way
> conversion between an integer array and an (Open)BitSet
> representing a p4delta data structure?
>
> In that case it would still be necessary to extend the lucene
> index structure to make it understand the p4delta data structure
> at the appropriate places.
> I can help getting the code integrated into lucene, but I've
> never done an index structure extension, so I'd like to have
> some support from this list for that.
> The code would initially need some package restructuring and
> layout changes, and then it could move forward to an index
> structure extension.
>
> Would you have some ideas on how to use de p4delta structure
> to store docIds, term frequencies and term positions?
> The references give some insights there, but it seems that there
> is still quite a bit of work to do get such "details" right.
> Fortunately, the existing Lucene TermDocs and TermPositions
> appear to be just right for this.
>
> Regards,
> Paul Elschot
>
>
> Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
> > Sorry, I meant lucene 2.4
> >
> > -John
> >
> > On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com>
> wrote:
> > > Hi guys:
> > >
> > >      We have build this on top of the lucene 1.4. api/refactoring
> > > for docid sets and docIdIterater.
> > >
> > >      We've implemented the p4Delta compression algorithm presented
> > > at www2008: http://www2008.org/papers/fp618.html
> > >
> > >      We've been using this in production here at LinkedIn and would
> > > love to contribute it into lucene.
> > >
> > >      We currently open sourced it at:
> > > http://code.google.com/p/lucene-ext/wiki/Kamikaze
> > >
> > >      Please let us know if it is thing you guys want to proceed, if
> > > so, what are the steps we should take.
> > >
> > > Thanks
> > >
> > > -John
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: docid set compression and boolean docid set operations

Posted by Paul Elschot <pa...@xs4all.nl>.

John,

I've taken a first look at the code, and I have a few questions.

Did I understand correctly that it is basically a two way
conversion between an integer array and an (Open)BitSet
representing a p4delta data structure?

In that case it would still be necessary to extend the lucene
index structure to make it understand the p4delta data structure
at the appropriate places.
I can help getting the code integrated into lucene, but I've
never done an index structure extension, so I'd like to have
some support from this list for that.
The code would initially need some package restructuring and
layout changes, and then it could move forward to an index
structure extension.

Would you have some ideas on how to use de p4delta structure
to store docIds, term frequencies and term positions?
The references give some insights there, but it seems that there
is still quite a bit of work to do get such "details" right.
Fortunately, the existing Lucene TermDocs and TermPositions
appear to be just right for this.

Regards,
Paul Elschot

Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
> Sorry, I meant lucene 2.4
>
> -John
>
> On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com> 
wrote:
> > Hi guys:
> >
> >      We have build this on top of the lucene 1.4. api/refactoring
> > for docid sets and docIdIterater.
> >
> >      We've implemented the p4Delta compression algorithm presented
> > at www2008: http://www2008.org/papers/fp618.html
> >
> >      We've been using this in production here at LinkedIn and would
> > love to contribute it into lucene.
> >
> >      We currently open sourced it at:
> > http://code.google.com/p/lucene-ext/wiki/Kamikaze
> >
> >      Please let us know if it is thing you guys want to proceed, if
> > so, what are the steps we should take.
> >
> > Thanks
> >
> > -John

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: docid set compression and boolean docid set operations

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi John,

I would love to see this added to Lucene!

Do you have actual performance numbers?  (Looks like the table in that  
link below isn't "real"?).

Mike

John Wang wrote:

> Sorry, I meant lucene 2.4
>
> -John
>
> On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com>  
> wrote:
> Hi guys:
>
>      We have build this on top of the lucene 1.4. api/refactoring  
> for docid sets and docIdIterater.
>
>      We've implemented the p4Delta compression algorithm presented  
> at www2008: http://www2008.org/papers/fp618.html
>
>      We've been using this in production here at LinkedIn and would  
> love to contribute it into lucene.
>
>      We currently open sourced it at: http://code.google.com/p/lucene-ext/wiki/Kamikaze
>
>      Please let us know if it is thing you guys want to proceed, if  
> so, what are the steps we should take.
>
> Thanks
>
> -John
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: docid set compression and boolean docid set operations

Posted by John Wang <jo...@gmail.com>.

Sorry, I meant lucene 2.4

-John

On Wed, Sep 10, 2008 at 2:08 PM, John Wang <jo...@gmail.com> wrote:

> Hi guys:
>
>      We have build this on top of the lucene 1.4. api/refactoring for docid
> sets and docIdIterater.
>
>      We've implemented the p4Delta compression algorithm presented at
> www2008: http://www2008.org/papers/fp618.html
>
>      We've been using this in production here at LinkedIn and would love to
> contribute it into lucene.
>
>      We currently open sourced it at:
> http://code.google.com/p/lucene-ext/wiki/Kamikaze
>
>      Please let us know if it is thing you guys want to proceed, if so,
> what are the steps we should take.
>
> Thanks
>
> -John
>
>