You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mike O'Leary <tm...@uw.edu> on 2012/03/02 00:15:23 UTC

Lucene's use of vectors

In the Javadoc page for the Similarity class, it says,

"Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM."

Is the Vector Space Model that is referred to here different than the term vectors that can optionally be stored in index fields? It sounds like the vector space model is used by Lucene in all cases in order to determine ranking of returned results, not only when indexing with term vectors is enabled. If you have indexed without term vectors, what does Lucene use to score "approved" documents? And if you have indexed with term vectors, what does that enable you to do that you couldn't do with an index without term vectors?

Is there a kind of search in Lucene in which documents are "approved" by VSM as well as scored by them, or does that even make sense? I understand how similarity works when comparing two documents, but I can't imagine that it would work to search by comparing a term vector from a set of search terms against each of the term vectors in an index one at a time. Is there a more efficient way of searching using a term vector of search terms - other than using its terms in a Boolean search that is?

I am asking because my boss asked me what all of the ways that Lucene uses vectors in indexing and search were, and my answer revealed a lot of gaps in my understanding of it.
Thanks,
Mike

Re: Lucene's use of vectors

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Mar 1, 2012 at 6:15 PM, Mike O'Leary <tm...@uw.edu> wrote:
> In the Javadoc page for the Similarity class, it says,
>
> "Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM."
>
> Is the Vector Space Model that is referred to here different than the term vectors that can optionally be stored in index fields?

Yes, it refers to http://en.wikipedia.org/wiki/Vector_space_model,
which uses statistics stored in the index. Term vectors are not used
here.

Instead term vectors are really just like storing a separate
individual inverted index for each document. For example, they are
used by MoreLikeThis to retrieve the terms and frequencies from just
that one document.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org