You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Lorz <mi...@yahoo.co.uk> on 2011/08/05 12:55:18 UTC

"Weighted" facet strings

Hi all,

I have documents which are (manually) tagged whith categories. Each 
category-document relation has a weight between 1 and 5: 

5: document fits perfectly in this category,
.
. 
1: document may be considered as belonging to this category. 


I would now like to use this information with solr. At the moment, I don't use 
the weight at all:

<field name="category" type="string" indexed="true" stored="true" 
multiValued="true"/>

Both the category as well as the document body are specified as query fields 
(<str name="qf"> in solrconfig.xml).


What I would like is the following:

- filter: category=some_category_name, query: *.*  - Results should be score by 
the above mentioned weight
- filter: category=some_category_name, query: some_keyword - Results should be 
scored by a combination of the score of 'some_keyword' and the above mentioned 
weight
- filter: none, query: some_category_name - Documents with category 
'some_category_name' should be found as well as documents which contain the term 
'some_category_name'. Results should be scored by a combination of the score of 
'some_keyword' and the above mentioned weight


Do you have any ideas how this could be done?

Thanks in advance
Michi

Re: "Weighted" facet strings

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: "Weighted" facet strings

First off: a terminology clarification.  what you are describing has very
little to do with facets.  it's true that your "category" field is a
"facet" of your documents, but in the context of your question, you aren't
asking about any facet related features of solr.
 
what you are really asking about is specifying weighted importance on
individual values indexed in the category field of your documents.

The suggestion in another reply to use use multiple fields (cat_weight_1, 
cat_weight_2, etc...) and then boost those fields accordingly is a 
classic, easy to implement solution to this type of problem that works 
relaly well when the cardinality of "weights" is low and fixed (in your 
case 1-5)

Another way people have dealt with problems like this historicly is to 
"keyword stuff" the category field -- so if a document has category 
weights: foo=5, bar=3 yak=1 you index "foo foo foo foo foo bar bar bar 
yak" in the category field.  As long as you use a similarity that defines 
tf() as an identity function, and doesn't use length norm, this also works 
really well.  (There are also tricks you can do using custom update 
processors or tokenizers to let you send "foo=5" over the wire and have it 
index the "foo" token with a termFreq of 5)

Looking forward: the "best" way to solve this problem in theory is using 
Payloads, but there aren't a lot of options currently availbable for 
leveraging payloads in Solrs query APIs / Parsers, so you'd probably have 
to write something custom.


How you actaully execute the queries depends on hte approach you take at 
indexig -- lets assume you do the keyword stuffing approach...

: - filter: category=some_category_name, query: *.*  - Results should be score by 
: the above mentioned weight

	q=cat:some_category_name 
	& sort=score desc

...with a simple tf() func the default score will do exactly what you want

of you could use the same {!boost} solution as below with "*:*" ....

: - filter: category=some_category_name, query: some_keyword - Results should be 
: scored by a combination of the score of 'some_keyword' and the above mentioned 
: weight

you just have to define what you mean by "combination" in terms of solr 
query functions.  easies is multiplicitively with the {!boost} parser...

	q={!boost b=tf(cat,'some_category_name')}some_keyword 
	& fq=cat:some_category_name 
	& sort = score desc

: - filter: none, query: some_category_name - Documents with category 
: 'some_category_name' should be found as well as documents which contain the term 
: 'some_category_name'. Results should be scored by a combination of the score of 
: 'some_keyword' and the above mentioned weight

...you could do this by including your category field in the qf of a 
dismax search.

assuming you want a isngle solution that works for all of these, and your 
"query: some_keyword" example includes the possibility that some_keyword 
is also a cateogry name (and you want it's weight taking it account as 
well) then an all inclusive solution would probably be something like...

        q={!boost b=tf(cat,'some_category_name') defType=}some_keyword
	& qf = cat^10 otherfields^5
        & fq=cat:some_category_name
        & sort = score desc




-Hoss

Re: "Weighted" facet strings

Posted by Jonathan Rochkind <ro...@jhu.edu>.

Ah wait, I forgot about dismax 'bq' parameter!  That might be a way to 
accomplish your first and second use cases. You probably still need the 
seperate _text_weight_X fields for your third use case.

Sorry I don't have a complete recipe for you, but hopefully these tools 
will help get you somewhere.

http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29

On 8/8/2011 11:16 AM, Jonathan Rochkind wrote:
> One kind of hacky way to accomplish some of those tasks involves 
> creating a lot more Solr fields. (This kind of 'de-normalization' is 
> often the answer to how to make Solr do something).
>
> So facet fields are ordinarily not tokenized or normalized at all. But 
> that doesn't work very well for matching query terms.  So if you want 
> actual queries to match on these categories, you probably want an 
> additional field that is tokenized/analyzed.  If you want to boost 
> different category assignments differently, you probably want 
> _multiple_ additional tokenized/analyzed fields.
>
> So for instance, create separate analyzed fields for each category 
> 'weight', perhaps using the default 'text' analysis type.
>
> categor_text_weight_1
> category_text_weight_2
> etc
>
> Then use dismax to query, include all those category_text_* fields in 
> the 'qf', and boost the higher weight ones more than the lower weight 
> ones.
>
> That will handle a number of your use cases, but not all of them.
>
> Your first two cases are the most problematic:
>
> "filter: category=some_category_name, query: *.* - Results should be 
> score by the above mentioned weight "
>
> So Solr doesn't really work like that. Normally a filter does not 
> effect the scoring of the actual results _at all_. But if you change 
> the query to:
>
> &fq=category:some_category
> &q=some_category
> &defType=dismax
> &qf=category_text_weight1, category_text_weight2^10, 
> category_text_weight3^20
>
> THEN, with the multiple analyzed category_text_weight_* fields, as 
> described above, I think it should do what you want. You may have to 
> play with exactly what boost to give to each field.
>
> But your second use case is still tricky.
>
> Solr doesn't really do exactly what you ask, but by using this method 
> I think you can figure out hacky ways to accomplish it.  I'm not sure 
> if it will solve all of your use cases, but maybe this will give you a 
> start to figuring it out.
>
>
> On 8/5/2011 6:55 AM, Michael Lorz wrote:
>> Hi all,
>>
>> I have documents which are (manually) tagged whith categories. Each
>> category-document relation has a weight between 1 and 5:
>>
>> 5: document fits perfectly in this category,
>> .
>> .
>> 1: document may be considered as belonging to this category.
>>
>>
>> I would now like to use this information with solr. At the moment, I 
>> don't use
>> the weight at all:
>>
>> <field name="category" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>>
>> Both the category as well as the document body are specified as query 
>> fields
>> (<str name="qf">  in solrconfig.xml).
>>
>>
>> What I would like is the following:
>>
>> - filter: category=some_category_name, query: *.*  - Results should 
>> be score by
>> the above mentioned weight
>> - filter: category=some_category_name, query: some_keyword - Results 
>> should be
>> scored by a combination of the score of 'some_keyword' and the above 
>> mentioned
>> weight
>> - filter: none, query: some_category_name - Documents with category
>> 'some_category_name' should be found as well as documents which 
>> contain the term
>> 'some_category_name'. Results should be scored by a combination of 
>> the score of
>> 'some_keyword' and the above mentioned weight
>>
>>
>> Do you have any ideas how this could be done?
>>
>> Thanks in advance
>> Michi

Re: "Weighted" facet strings

Posted by Jonathan Rochkind <ro...@jhu.edu>.

One kind of hacky way to accomplish some of those tasks involves 
creating a lot more Solr fields. (This kind of 'de-normalization' is 
often the answer to how to make Solr do something).

So facet fields are ordinarily not tokenized or normalized at all. But 
that doesn't work very well for matching query terms.  So if you want 
actual queries to match on these categories, you probably want an 
additional field that is tokenized/analyzed.  If you want to boost 
different category assignments differently, you probably want _multiple_ 
additional tokenized/analyzed fields.

So for instance, create separate analyzed fields for each category 
'weight', perhaps using the default 'text' analysis type.

categor_text_weight_1
category_text_weight_2
etc

Then use dismax to query, include all those category_text_* fields in 
the 'qf', and boost the higher weight ones more than the lower weight ones.

That will handle a number of your use cases, but not all of them.

Your first two cases are the most problematic:

"filter: category=some_category_name, query: *.* - Results should be 
score by the above mentioned weight "

So Solr doesn't really work like that. Normally a filter does not effect 
the scoring of the actual results _at all_. But if you change the query to:

&fq=category:some_category
&q=some_category
&defType=dismax
&qf=category_text_weight1, category_text_weight2^10, 
category_text_weight3^20

THEN, with the multiple analyzed category_text_weight_* fields, as 
described above, I think it should do what you want. You may have to 
play with exactly what boost to give to each field.

But your second use case is still tricky.

Solr doesn't really do exactly what you ask, but by using this method I 
think you can figure out hacky ways to accomplish it.  I'm not sure if 
it will solve all of your use cases, but maybe this will give you a 
start to figuring it out.

On 8/5/2011 6:55 AM, Michael Lorz wrote:
> Hi all,
>
> I have documents which are (manually) tagged whith categories. Each
> category-document relation has a weight between 1 and 5:
>
> 5: document fits perfectly in this category,
> .
> .
> 1: document may be considered as belonging to this category.
>
>
> I would now like to use this information with solr. At the moment, I don't use
> the weight at all:
>
> <field name="category" type="string" indexed="true" stored="true"
> multiValued="true"/>
>
> Both the category as well as the document body are specified as query fields
> (<str name="qf">  in solrconfig.xml).
>
>
> What I would like is the following:
>
> - filter: category=some_category_name, query: *.*  - Results should be score by
> the above mentioned weight
> - filter: category=some_category_name, query: some_keyword - Results should be
> scored by a combination of the score of 'some_keyword' and the above mentioned
> weight
> - filter: none, query: some_category_name - Documents with category
> 'some_category_name' should be found as well as documents which contain the term
> 'some_category_name'. Results should be scored by a combination of the score of
> 'some_keyword' and the above mentioned weight
>
>
> Do you have any ideas how this could be done?
>
> Thanks in advance
> Michi