You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Karl Koch <Th...@gmx.net> on 2004/03/19 17:24:19 UTC

Lucene index - information

If I create an standard index, what does Lucene store in this index?

What should be stored in an index at least? Just a link to the file and
keywords? Or also wordnumbers? What else?

Does somebody know a paper which discusses this problem of "what to put in
an good universal IR index" ?

Cheers,
Karl

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene index - information

Posted by David Spencer <da...@tropo.com>.

Karl Koch wrote:

> If I create an standard index, what does Lucene store in this index?
> 
> What should be stored in an index at least? Just a link to the file and
> keywords? Or also wordnumbers? What else?
> 
> Does somebody know a paper which discusses this problem of "what to put in
> an good universal IR index" ?


Well if you want a textbook I found "Managing Gigabytes" to have 
excellent coverage of the internals and messy details of search/indexes.

http://www.amazon.com/exec/obidos/ASIN/1558605703/tropoA
http://www.cs.mu.oz.au/mg/



> 
> Cheers,
> Karl
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Cover density ranking?

Posted by Doug Cutting <cu...@apache.org>.

Boris Goldowsky wrote:
> How difficult would it be to implement something like Cover Density
> ranking for Lucene?  Has anyone tried it?  
> 
> Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
> and is supposed to be particularly good for short queries of the type
> that you get in many web applications.

I just glanced at the paper, so my analysis may be wrong, but I think 
one could implement cover density ranking in Lucene with spans (only in 
CVS, not in 1.3).  I think spans correspond to covers in this paper. 
But you'd need to alter SpanScorer.java to implement the cover scoring 
described in that paper.  And you'd probably need to use a custom 
Similarity implementation, which disables most other scoring (tf=1.0, 
idf=1.0, etc.), but exaggerates coordination.  Finally, you'd need to 
construct span queries.  Or something like that.

If someone tries this, please tell us how it works.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Cover density ranking?

Posted by Boris Goldowsky <bo...@alum.mit.edu>.

Since there have been a few discussions recently of overriding various
aspects of Lucene's ranking formula, I got to wondering how difficult it
might be to implement something more different from the base tf/idf
ranking system that Lucene has built in.

How difficult would it be to implement something like Cover Density
ranking for Lucene?  Has anyone tried it?  

Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
and is supposed to be particularly good for short queries of the type
that you get in many web applications.

Boris



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: VSpace Model Index <-> Prob. Model Index - Difference?

Posted by Ype Kingma <yk...@xs4all.nl>.

Karl,

On Friday 19 March 2004 18:24, Karl Koch wrote:
> Hello group,
>
> coming back to the discussion about probabilistic and vector space model
> (which occured here some time ago), I would like to ask something related.
>
> I only know the index structure Lucene offers. Does a IR system, based on
> the probabilistic model (e.g. Okapi) look different from a VS model? If
> yes, why?
>
> I hope this questions is not too stupid. I am mainly interested because of
> some theoretical background...
>
> Karl

First off: I don't know about the fine points between probabilistic and VS 
models.

Sometime ago I made a quick comparison between
the default scoring method of lucene and the okapi model.
Of the top of my head I remember this (it is
not complete):

Similarities:
- both do term weighting by inverse document frequency,
- both normalize for document length, effectively using term density.
- both have a saturation for this term density.
Differences:
Okapi can also use the document length in by itself.
Lucene has a factor (coord) for the overlap between a query and a document
(ie. the number of matching query terms present in a document).
The term density saturation functions are different, too:
Lucene uses square root, okapi uses an (increasing) reciprocal, however
in practice the limit if the reciprocal is far from reached.

When the overlap is ignored, from a practical view point, I would
be surprised if the two methods would order a given set of
docs much different for the same query.
I'd expect most differences in the 'middle' due to the differences
in the form (2nd derivative) of the saturation functions.

Coming back to your question:

> I only know the index structure Lucene offers. Does a IR system, based on
> the probabilistic model (e.g. Okapi) look different from a VS model? If
> yes, why?

My guess is that, in practice (ie. in the orderings of documents for queries),
the two systems are much more similar than different.

> I hope this questions is not too stupid. I am mainly interested because of
> some theoretical background...

Do you intend to do a theoretical comparison of the scoring
functions of Lucene and Okapi? AFAIK this has not been investigated.

Kind regards,
Ype Kingma

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

VSpace Model Index <-> Prob. Model Index - Difference?

Posted by Karl Koch <Th...@gmx.net>.

Hello group,

coming back to the discussion about probabilistic and vector space model
(which occured here some time ago), I would like to ask something related.

I only know the index structure Lucene offers. Does a IR system, based on
the probabilistic model (e.g. Okapi) look different from a VS model? If yes,
why? 

I hope this questions is not too stupid. I am mainly interested because of
some theoretical background...

Karl

> Uh, there are lots of ways to construct an inverted index.
> Citeseer will give you more than you can read on this topic.
> 
> As for Lucene, see File Formats section on the site.
> 
> Otis
> 
> --- Karl Koch <Th...@gmx.net> wrote:
> > If I create an standard index, what does Lucene store in this index?
> > 
> > What should be stored in an index at least? Just a link to the file
> > and
> > keywords? Or also wordnumbers? What else?
> > 
> > Does somebody know a paper which discusses this problem of "what to
> > put in
> > an good universal IR index" ?
> > 
> > Cheers,
> > Karl
> > 
> > -- 
> > +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter
> > Virenschutz +++
> > 100% Virenerkennung nach Wildlist. Infos:
> > http://www.gmx.net/virenschutz
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene index - information

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Uh, there are lots of ways to construct an inverted index.
Citeseer will give you more than you can read on this topic.

As for Lucene, see File Formats section on the site.

Otis

--- Karl Koch <Th...@gmx.net> wrote:
> If I create an standard index, what does Lucene store in this index?
> 
> What should be stored in an index at least? Just a link to the file
> and
> keywords? Or also wordnumbers? What else?
> 
> Does somebody know a paper which discusses this problem of "what to
> put in
> an good universal IR index" ?
> 
> Cheers,
> Karl
> 
> -- 
> +++ NEU bei GMX und erstmalig in Deutschland: T�V-gepr�fter
> Virenschutz +++
> 100% Virenerkennung nach Wildlist. Infos:
> http://www.gmx.net/virenschutz
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org