You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Igor Perisic <ip...@entopia.com> on 2004/07/02 23:25:59 UTC

Indexing documents through a set of (word, weight) pairs

Hi Lucene experts:

   We are trying to build a simple document index on top of Lucene. 

   We have: 
	Given a document, there is a list of terms (e.g. word and weight pairs). 
	The queries we want to be able to handle are:
		* Given a document, what are the terms?
		* Given some terms, what are the documents? 
		Note here that the above weights are used for our own customized scoring.

We want to use Lucene as much as possible (not wanting to reinvent the wheel), what are our options?

We can reuse some of the classes such as TermInfosWriter/Reader to store our lexicon, but is there more stuff in Lucene we can take advantage of? Are we going in the right direction?



Cheers,

		Igor




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing documents through a set of (word, weight) pairs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Bok Igor,

For the first use case, you are really looking for what's called a
Forward Index.  If my memory serves me well, there was a project that
used Lucene at MIT called Haystack, and its author developed code that
worked with Lucene and created forward indices.  I never actually tried
it.

If that doesn't do it for you, see this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int)

For the second use case - well, that is precisely what Lucene does. :)

It sounds like you are going in the right direction.  I suggest reading
a few Lucene articles (links on the Lucene Wiki) to get started.

Otis



--- Igor Perisic <ip...@entopia.com> wrote:
> Hi Lucene experts:
> 
>    We are trying to build a simple document index on top of Lucene. 
> 
>    We have: 
> 	Given a document, there is a list of terms (e.g. word and weight
> pairs). 
> 	The queries we want to be able to handle are:
> 		* Given a document, what are the terms?
> 		* Given some terms, what are the documents? 
> 		Note here that the above weights are used for our own customized
> scoring.
> 
> We want to use Lucene as much as possible (not wanting to reinvent
> the wheel), what are our options?
> 
> We can reuse some of the classes such as TermInfosWriter/Reader to
> store our lexicon, but is there more stuff in Lucene we can take
> advantage of? Are we going in the right direction?
> 
> 
> 
> Cheers,
> 
> 		Igor
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org