You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by jian chen <ch...@gmail.com> on 2005/10/16 03:36:48 UTC

skipInterval

Hi, All,

I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.

I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInfosWriter.java class:
int skipInterval = 16;

Therefore, I have two questions:

1) Would it be a good idea and feasible to use sqrt(df) to be the
skipInterval, rather than hardcode it?

2) When merging segments, for every term, the skip table is buffered first
in the RAMOutputStream and then written to the output stream. If there are
lot of documents for a term, this seems to consume a lot of memory, right?
If instead, we use sqrt(df) to be the skipInterval, the memory consumed will
be a lot less, as it is logarithmic.

Hope some one could shed more light on this. Thanks in advance,

Jian

Re: Fwd: skipInterval

Posted by jian chen <ch...@gmail.com>.

Hi, Paul,

Thanks for your email. I am not sure how the sqrt vs. constant for
skipInterval will pan out for two or multiple required terms. That needs
some experiments I guess.

Cheers,

Jian

On 10/16/05, Paul Elschot <pa...@xs4all.nl> wrote:
>
> Jian,


> ---------- Forwarded message ----------
> > From: jian chen <ch...@gmail.com>
> > Date: Oct 15, 2005 6:36 PM
> > Subject: skipInterval
> > To: Lucene Developers List <lu...@jakarta.apache.org>
> >
> > Hi, All,
> >
> > I was reading some research papers regarding quick inverted index
> lookups.
> > The classical approach to skipping dictates that a skip should be
> positioned
> > every sqrt(df) document pointers.
>
> The typical use of skipping info in Lucene is in ConjunctionScorer, for a
> query with two required terms. There it helps for the case when one
> term occurs much less frequently than another.
> Iirc the sqrt() is optimal for a single lookup in a single level index,
> reducing the complexity from linear to logarithmic.
> Does the sqrt() also apply in the case of searching for two required terms
> and returning all the documents in which they both occur?
>
> Regards,
> Paul Elschot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

Re: Fwd: skipInterval

Posted by Paul Elschot <pa...@xs4all.nl>.

Jian,

On Sunday 16 October 2005 03:42, jian chen wrote:
> Hi, All,
> 
> I should have sent to this email address rather than the old jakarta email
> address. Sorry if double-posted.
> 
> Jian
> 
> ---------- Forwarded message ----------
> From: jian chen <ch...@gmail.com>
> Date: Oct 15, 2005 6:36 PM
> Subject: skipInterval
> To: Lucene Developers List <lu...@jakarta.apache.org>
> 
> Hi, All,
> 
> I was reading some research papers regarding quick inverted index lookups.
> The classical approach to skipping dictates that a skip should be positioned
> every sqrt(df) document pointers.

The typical use of skipping info in Lucene is in ConjunctionScorer, for a
query with two required terms. There it helps for the case when one
term occurs much less frequently than another.
Iirc the sqrt() is optimal for a single lookup in a single level index,
reducing the complexity from linear to logarithmic.
Does the sqrt() also apply in the case of searching for two required terms
and returning all the documents in which they both occur?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Fwd: skipInterval

Posted by jian chen <ch...@gmail.com>.

Hi, All,

I should have sent to this email address rather than the old jakarta email
address. Sorry if double-posted.

Jian

---------- Forwarded message ----------
From: jian chen <ch...@gmail.com>
Date: Oct 15, 2005 6:36 PM
Subject: skipInterval
To: Lucene Developers List <lu...@jakarta.apache.org>

Hi, All,

I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.

I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInfosWriter.java class:
int skipInterval = 16;

Therefore, I have two questions:

1) Would it be a good idea and feasible to use sqrt(df) to be the
skipInterval, rather than hardcode it?

2) When merging segments, for every term, the skip table is buffered first
in the RAMOutputStream and then written to the output stream. If there are
lot of documents for a term, this seems to consume a lot of memory, right?
If instead, we use sqrt(df) to be the skipInterval, the memory consumed will
be a lot less, as it is logarithmic.

Hope some one could shed more light on this. Thanks in advance,

Jian