You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Parks <da...@yahoo.com> on 2013/05/14 13:04:16 UTC

Boosting documents with terms derived from clustering - good idea?

We have a number of queries that produce good results based on the textual
data, but are contextually wrong (for example, an "SSD hard drive" search
matches the music album "SSD hip hop drives us crazy".

 

Textually a fair match, but SSD is a term that strongly relates to technical
documents.

 

We'd like to be able to direct this query more strictly in the direction of
the technical documents based on the term "SSD".  I am considering whether
it would be worth trying to cluster all documents, thus tending to group the
music with the music and tech items with the tech items. Then pulling out
the term vectors that define each group; do a human review of that data; and
plug it back into the documents of each cluster as a separate search field
that gets boosted.

 

In my head it seems like a plausible way to weigh terms like SSD to the
cluster of items that it most closely associates.

 

Should I spend the effort to find out?

Yeh or neh?

Re: Boosting documents with terms derived from clustering - good idea?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

I would take a different approach.  Track users' queries and their
clicks.  Aggregate queries and start thinking of them as tags/labels.
Aggregate them and use top N to tag your docs.
Alternatively/additionally, extract significant terms and phrases from
clicked-to docs and use that to tag your docs.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html




On Tue, May 14, 2013 at 7:04 AM, David Parks <da...@yahoo.com> wrote:
> We have a number of queries that produce good results based on the textual
> data, but are contextually wrong (for example, an "SSD hard drive" search
> matches the music album "SSD hip hop drives us crazy".
>
>
>
> Textually a fair match, but SSD is a term that strongly relates to technical
> documents.
>
>
>
> We'd like to be able to direct this query more strictly in the direction of
> the technical documents based on the term "SSD".  I am considering whether
> it would be worth trying to cluster all documents, thus tending to group the
> music with the music and tech items with the tech items. Then pulling out
> the term vectors that define each group; do a human review of that data; and
> plug it back into the documents of each cluster as a separate search field
> that gets boosted.
>
>
>
> In my head it seems like a plausible way to weigh terms like SSD to the
> cluster of items that it most closely associates.
>
>
>
> Should I spend the effort to find out?
>
> Yeh or neh?
>