You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/10 18:58:57 UTC

Vector based queries

I have a case where I'd like to get documents which most closely match a 
particular vector. The RowSimilarityJob of Mahout is ideal for 
precalculating similarity between existing documents but in my case the 
query is constructed at run time. So the UI constructs a vector to be 
used as a query. We have this running in prototype using a run time 
calculation of cosine similarity but the implementation is not scalable 
to large doc stores.

One thought is to calculate fairly small clusters. The UI will know 
which cluster to target for the vector query. So we might be able to 
narrow down the number of docs per query to a reasonable size.

It seems like a place for multiple hash functions maybe? Could we use 
some kind of hack of the boost feature of Solr or some other approach?

Does anyone have a suggestion?

Re: Vector based queries

Posted by Bill Bell <bi...@gmail.com>.

It is way too slow

Sent from my Mobile device
720-256-8076

On Mar 11, 2012, at 12:07 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I found a description here: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
> 
> If it is the same four years later, it looks like lucene is doing an index lookup for each important term in the example doc boosting each term based on the term weights. My guess would be that this is a little slower than 2-3word query but still scalable.
> 
> Has anyone used this on a very large index?
> 
> Thanks,
> Pat
> 
> On 3/11/12 10:45 AM, Pat Ferrel wrote:
>> MoreLikeThis looks exactly like what I need. I would probably create a new "like" method to take a mahout vector and build a search? I build the vector by starting from a doc and reweighting certain terms. The prototype just reweights words but I may experiment with dirichlet clusters and reweighting an entire cluster of words so you could boost the importance of a topic in the results. Still the result of either algorithm would be a mahout vector.
>> 
>> Is there a description of how this works somewhere? Is it basically an index lookup? I always though the Google feature used precalculated results (and it probably does). I'm curious but mainly asking to see how fast it is.
>> 
>> Thanks
>> Pat
>> 
>> On 3/11/12 8:36 AM, Paul Libbrecht wrote:
>>> Maybe that's exactly it but... given a document with n tokens A, and m tokens B, a query A^n B^m would find what you're looking for or?
>>> 
>>> paul
>>> 
>>> PS  I've always viewed queries as linear forms on the vector space and I'd like to see this really mathematically written one day...
>>> Le 11 mars 2012 à 07:23, Lance Norskog a écrit :
>>> 
>>>> Look at the MoreLikeThis feature in Lucene. I believe it does roughly
>>>> what you describe.
>>>> 
>>>> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>>>>> I have a case where I'd like to get documents which most closely match a
>>>>> particular vector. The RowSimilarityJob of Mahout is ideal for
>>>>> precalculating similarity between existing documents but in my case the
>>>>> query is constructed at run time. So the UI constructs a vector to be used
>>>>> as a query. We have this running in prototype using a run time calculation
>>>>> of cosine similarity but the implementation is not scalable to large doc
>>>>> stores.
>>>>> 
>>>>> One thought is to calculate fairly small clusters. The UI will know which
>>>>> cluster to target for the vector query. So we might be able to narrow down
>>>>> the number of docs per query to a reasonable size.
>>>>> 
>>>>> It seems like a place for multiple hash functions maybe? Could we use some
>>>>> kind of hack of the boost feature of Solr or some other approach?
>>>>> 
>>>>> Does anyone have a suggestion?
>>>> 
>>>> 
>>>> -- 
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>

Re: Vector based queries

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I found a description here: 
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

If it is the same four years later, it looks like lucene is doing an 
index lookup for each important term in the example doc boosting each 
term based on the term weights. My guess would be that this is a little 
slower than 2-3word query but still scalable.

Has anyone used this on a very large index?

Thanks,
Pat

On 3/11/12 10:45 AM, Pat Ferrel wrote:
> MoreLikeThis looks exactly like what I need. I would probably create a 
> new "like" method to take a mahout vector and build a search? I build 
> the vector by starting from a doc and reweighting certain terms. The 
> prototype just reweights words but I may experiment with dirichlet 
> clusters and reweighting an entire cluster of words so you could boost 
> the importance of a topic in the results. Still the result of either 
> algorithm would be a mahout vector.
>
> Is there a description of how this works somewhere? Is it basically an 
> index lookup? I always though the Google feature used precalculated 
> results (and it probably does). I'm curious but mainly asking to see 
> how fast it is.
>
> Thanks
> Pat
>
> On 3/11/12 8:36 AM, Paul Libbrecht wrote:
>> Maybe that's exactly it but... given a document with n tokens A, and 
>> m tokens B, a query A^n B^m would find what you're looking for or?
>>
>> paul
>>
>> PS  I've always viewed queries as linear forms on the vector space 
>> and I'd like to see this really mathematically written one day...
>> Le 11 mars 2012 à 07:23, Lance Norskog a écrit :
>>
>>> Look at the MoreLikeThis feature in Lucene. I believe it does roughly
>>> what you describe.
>>>
>>> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<pa...@occamsmachete.com>  
>>> wrote:
>>>> I have a case where I'd like to get documents which most closely 
>>>> match a
>>>> particular vector. The RowSimilarityJob of Mahout is ideal for
>>>> precalculating similarity between existing documents but in my case 
>>>> the
>>>> query is constructed at run time. So the UI constructs a vector to 
>>>> be used
>>>> as a query. We have this running in prototype using a run time 
>>>> calculation
>>>> of cosine similarity but the implementation is not scalable to 
>>>> large doc
>>>> stores.
>>>>
>>>> One thought is to calculate fairly small clusters. The UI will know 
>>>> which
>>>> cluster to target for the vector query. So we might be able to 
>>>> narrow down
>>>> the number of docs per query to a reasonable size.
>>>>
>>>> It seems like a place for multiple hash functions maybe? Could we 
>>>> use some
>>>> kind of hack of the boost feature of Solr or some other approach?
>>>>
>>>> Does anyone have a suggestion?
>>>
>>>
>>> -- 
>>> Lance Norskog
>>> goksron@gmail.com
>>

Re: Vector based queries

Posted by Pat Ferrel <pa...@occamsmachete.com>.

MoreLikeThis looks exactly like what I need. I would probably create a new "like" method to take a mahout vector and build a search? I build the vector by starting from a doc and reweighting certain terms. The prototype just reweights words but I may experiment with dirichlet clusters and reweighting an entire cluster of words so you could boost the importance of a topic in the results. Still the result of either algorithm would be a mahout vector.

Is there a description of how this works somewhere? Is it basically an index lookup? I always though the Google feature used precalculated results (and it probably does). I'm curious but mainly asking to see how fast it is.

Thanks
Pat

On 3/11/12 8:36 AM, Paul Libbrecht wrote:
> Maybe that's exactly it but... given a document with n tokens A, and m tokens B, a query A^n B^m would find what you're looking for or?
>
> paul
>
> PS  I've always viewed queries as linear forms on the vector space and I'd like to see this really mathematically written one day...
> Le 11 mars 2012 à 07:23, Lance Norskog a écrit :
>
>> Look at the MoreLikeThis feature in Lucene. I believe it does roughly
>> what you describe.
>>
>> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>>> I have a case where I'd like to get documents which most closely match a
>>> particular vector. The RowSimilarityJob of Mahout is ideal for
>>> precalculating similarity between existing documents but in my case the
>>> query is constructed at run time. So the UI constructs a vector to be used
>>> as a query. We have this running in prototype using a run time calculation
>>> of cosine similarity but the implementation is not scalable to large doc
>>> stores.
>>>
>>> One thought is to calculate fairly small clusters. The UI will know which
>>> cluster to target for the vector query. So we might be able to narrow down
>>> the number of docs per query to a reasonable size.
>>>
>>> It seems like a place for multiple hash functions maybe? Could we use some
>>> kind of hack of the boost feature of Solr or some other approach?
>>>
>>> Does anyone have a suggestion?
>>
>>
>> -- 
>> Lance Norskog
>> goksron@gmail.com
>

Re: Vector based queries

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Maybe that's exactly it but... given a document with n tokens A, and m tokens B, a query A^n B^m would find what you're looking for or?

paul

PS  I've always viewed queries as linear forms on the vector space and I'd like to see this really mathematically written one day...
Le 11 mars 2012 à 07:23, Lance Norskog a écrit :

> Look at the MoreLikeThis feature in Lucene. I believe it does roughly
> what you describe.
> 
> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> I have a case where I'd like to get documents which most closely match a
>> particular vector. The RowSimilarityJob of Mahout is ideal for
>> precalculating similarity between existing documents but in my case the
>> query is constructed at run time. So the UI constructs a vector to be used
>> as a query. We have this running in prototype using a run time calculation
>> of cosine similarity but the implementation is not scalable to large doc
>> stores.
>> 
>> One thought is to calculate fairly small clusters. The UI will know which
>> cluster to target for the vector query. So we might be able to narrow down
>> the number of docs per query to a reasonable size.
>> 
>> It seems like a place for multiple hash functions maybe? Could we use some
>> kind of hack of the boost feature of Solr or some other approach?
>> 
>> Does anyone have a suggestion?
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Vector based queries

Posted by Lance Norskog <go...@gmail.com>.

Look at the MoreLikeThis feature in Lucene. I believe it does roughly
what you describe.

On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> I have a case where I'd like to get documents which most closely match a
> particular vector. The RowSimilarityJob of Mahout is ideal for
> precalculating similarity between existing documents but in my case the
> query is constructed at run time. So the UI constructs a vector to be used
> as a query. We have this running in prototype using a run time calculation
> of cosine similarity but the implementation is not scalable to large doc
> stores.
>
> One thought is to calculate fairly small clusters. The UI will know which
> cluster to target for the vector query. So we might be able to narrow down
> the number of docs per query to a reasonable size.
>
> It seems like a place for multiple hash functions maybe? Could we use some
> kind of hack of the boost feature of Solr or some other approach?
>
> Does anyone have a suggestion?



-- 
Lance Norskog
goksron@gmail.com