You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike Hughes <mf...@gmail.com> on 2010/02/10 01:49:08 UTC

Bigram term vectors and weights possible with Solr?

Hello,

One of the commercial search platforms I work with has the concept of
'document vectors', which are 1-gram and 2-gram phrases and their
associated tf/idf weights on a 0-1 scale, i.e. ["banana pie", 0.99]
means banana pie is very relevant for this document.

During the ingest/indexing process you can configure the engine to
store the top N vectors (those with the highest weights) from a
document into a field that is indexed along with the original content
and is returned in a result set.  This is great for reporting and
other statistical analysis, and even some basic result clustering at
query time.

I've been looking at the Solr TermVectorComponent
(http://wiki.apache.org/solr/TermVectorComponent) and it seems to have
something similar to this, but it looks to me like this is a component
that is processed at query time (?) and is limited to 1-gram terms.
Also, the tf/idf scores are a little different as they come back in
integer values as separate components.

Does anyone know if Solr/Lucene has anything like what the commercial
platform has as I described above?

Thanks, appreciate any responses.

Michael Hughes
Lightcrest LLC

Re: Bigram term vectors and weights possible with Solr?

Posted by Mike Hughes <mf...@gmail.com>.
Thank you Ahmet, this is exactly what I was looking for.  Looks like
the shingle filter can produce 3+-gram terms as well, that's great.
I'm going to try this with both western and CJK language tokenizers
and see how it turns out.

On Tue, Feb 9, 2010 at 5:07 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>> I've been looking at the Solr TermVectorComponent
>> (http://wiki.apache.org/solr/TermVectorComponent) and it
>> seems to have
>> something similar to this, but it looks to me like this is
>> a component
>> that is processed at query time (?) and is limited to
>> 1-gram terms.
>
> If you use <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/> it can give you info about 2-gram terms.
>
>> Also, the tf/idf scores are a little different as they come
>> back in integer values as separate components.
>
> In wiki, example output only tf and df values - which are integer - are displayed. You can calculate tf*idf (double) with these parameters:
>
> &qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true
>
>
>
>

Re: Bigram term vectors and weights possible with Solr?

Posted by Ahmet Arslan <io...@yahoo.com>.
> I've been looking at the Solr TermVectorComponent
> (http://wiki.apache.org/solr/TermVectorComponent) and it
> seems to have
> something similar to this, but it looks to me like this is
> a component
> that is processed at query time (?) and is limited to
> 1-gram terms.

If you use <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/> it can give you info about 2-gram terms.

> Also, the tf/idf scores are a little different as they come
> back in integer values as separate components.

In wiki, example output only tf and df values - which are integer - are displayed. You can calculate tf*idf (double) with these parameters:

&qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true