You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/02/21 04:28:56 UTC

TermVector usage

Greets,

KinoSearch 0.05, which for now I'm calling a "loose port" of Lucene,  
was published to CPAN a few weeks ago.  It's nice and fast, but  
missing some features, most notably multiple segment support and  
incremental indexing.  Before I get to that though, I'm adding  
excerpting and highlighting.

The version of KinoSearch which preceded the Lucene-based rewrite  
also had a highlighter which depended on what were effectively  
TermVectors with stored offsets. However, unlike Lucene, these were  
stored along with the stored fields.  As I've been preparing to port  
all the support apparatus for TermVectors, I've been wondering  
whether I shouldn't go back to that.  It sure would be less work to  
code up.  Theoretically there ought to be less disk activity, too.

 From following the Lucene lists off and on, I've gotten the  
impression that lots of people use TermVectors to feed the  
highlighter, but I haven't seen many applications for them besides  
that.  LSI-type ideas percolate every once in a while.  Besides  
highlighting, how many people are using TermVectors and how are they  
using them?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TermVector usage

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Feb 20, 2006, at 9:47 PM, Otis Gospodnetic wrote:

> As far as I can tell, most people use TermVectors for "more like  
> this" queries (see MoreLikeThis class in contrib/ somewhere)

On Feb 21, 2006, at 5:39 AM, Erik Hatcher wrote:
> I use term vectors for "more like this" queries, such as the links  
> you'll see here:
>
> 	<http://www.rossettiarchive.org/rose/?query=%2B%28%2Bblessed+% 
> 2Bdamozel%29+%2B%28archivetype%3Arad%29>

Thanks, Otis and Erik.  (MoreLikeThis is under contrib/similarity.)   
Looking at the way MoreLikeThis is implemented, my impression is that  
it wouldn't hurt and might help a smidge to store the term vector  
with the stored document.

What I don't yet see is a benefit to having all TermVectors reside  
side-by-side in the same file.  A full vector-space search which  
compares complete document vectors and thus needs to scan through all  
TermVectors for each query is the only application I've thought of so  
far.  Of course such a beast is impractical for a search engine of  
any reasonable size, so you need some method of data reduction.   
LSI's decomposition is one way of hacking at that problem, but you  
don't do that on the fly at search-time. :)  Another is the heuristic  
process applied by the MoreLikeThis class, but MoreLikeThis only  
needs a single document's TermVectors.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TermVector usage

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

I use term vectors for "more like this" queries, such as the links  
you'll see here:

	<http://www.rossettiarchive.org/rose/?query=%2B%28%2Bblessed+% 
2Bdamozel%29+%2B%28archivetype%3Arad%29>

I am using the MoreLikeThis class.

	Erik



On Feb 21, 2006, at 12:47 AM, Otis Gospodnetic wrote:
> As far as I can tell, most people use TermVectors for "more like  
> this" queries (see MoreLikeThis class in contrib/ somewhere)
>
> Otis
>
> ----- Original Message ----
> From: Marvin Humphrey <ma...@rectangular.com>
> To: java-dev@lucene.apache.org
> Sent: Mon 20 Feb 2006 10:28:56 PM EST
> Subject: TermVector usage
>
> Greets,
>
> KinoSearch 0.05, which for now I'm calling a "loose port" of Lucene,
> was published to CPAN a few weeks ago.  It's nice and fast, but
> missing some features, most notably multiple segment support and
> incremental indexing.  Before I get to that though, I'm adding
> excerpting and highlighting.
>
> The version of KinoSearch which preceded the Lucene-based rewrite
> also had a highlighter which depended on what were effectively
> TermVectors with stored offsets. However, unlike Lucene, these were
> stored along with the stored fields.  As I've been preparing to port
> all the support apparatus for TermVectors, I've been wondering
> whether I shouldn't go back to that.  It sure would be less work to
> code up.  Theoretically there ought to be less disk activity, too.
>
>  From following the Lucene lists off and on, I've gotten the
> impression that lots of people use TermVectors to feed the
> highlighter, but I haven't seen many applications for them besides
> that.  LSI-type ideas percolate every once in a while.  Besides
> highlighting, how many people are using TermVectors and how are they
> using them?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: TermVector usage

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Marvin,

As far as I can tell, most people use TermVectors for "more like this" queries (see MoreLikeThis class in contrib/ somewhere)

Otis

----- Original Message ----
From: Marvin Humphrey <ma...@rectangular.com>
To: java-dev@lucene.apache.org
Sent: Mon 20 Feb 2006 10:28:56 PM EST
Subject: TermVector usage

Greets,

KinoSearch 0.05, which for now I'm calling a "loose port" of Lucene,  
was published to CPAN a few weeks ago.  It's nice and fast, but  
missing some features, most notably multiple segment support and  
incremental indexing.  Before I get to that though, I'm adding  
excerpting and highlighting.

The version of KinoSearch which preceded the Lucene-based rewrite  
also had a highlighter which depended on what were effectively  
TermVectors with stored offsets. However, unlike Lucene, these were  
stored along with the stored fields.  As I've been preparing to port  
all the support apparatus for TermVectors, I've been wondering  
whether I shouldn't go back to that.  It sure would be less work to  
code up.  Theoretically there ought to be less disk activity, too.

 From following the Lucene lists off and on, I've gotten the  
impression that lots of people use TermVectors to feed the  
highlighter, but I haven't seen many applications for them besides  
that.  LSI-type ideas percolate every once in a while.  Besides  
highlighting, how many people are using TermVectors and how are they  
using them?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org