You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Lorz <mi...@yahoo.co.uk> on 2011/08/05 12:55:18 UTC
"Weighted" facet strings
Hi all,
I have documents which are (manually) tagged whith categories. Each
category-document relation has a weight between 1 and 5:
5: document fits perfectly in this category,
.
.
1: document may be considered as belonging to this category.
I would now like to use this information with solr. At the moment, I don't use
the weight at all:
<field name="category" type="string" indexed="true" stored="true"
multiValued="true"/>
Both the category as well as the document body are specified as query fields
(<str name="qf"> in solrconfig.xml).
What I would like is the following:
- filter: category=some_category_name, query: *.* - Results should be score by
the above mentioned weight
- filter: category=some_category_name, query: some_keyword - Results should be
scored by a combination of the score of 'some_keyword' and the above mentioned
weight
- filter: none, query: some_category_name - Documents with category
'some_category_name' should be found as well as documents which contain the term
'some_category_name'. Results should be scored by a combination of the score of
'some_keyword' and the above mentioned weight
Do you have any ideas how this could be done?
Thanks in advance
Michi
Re: "Weighted" facet strings
Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: "Weighted" facet strings
First off: a terminology clarification. what you are describing has very
little to do with facets. it's true that your "category" field is a
"facet" of your documents, but in the context of your question, you aren't
asking about any facet related features of solr.
what you are really asking about is specifying weighted importance on
individual values indexed in the category field of your documents.
The suggestion in another reply to use use multiple fields (cat_weight_1,
cat_weight_2, etc...) and then boost those fields accordingly is a
classic, easy to implement solution to this type of problem that works
relaly well when the cardinality of "weights" is low and fixed (in your
case 1-5)
Another way people have dealt with problems like this historicly is to
"keyword stuff" the category field -- so if a document has category
weights: foo=5, bar=3 yak=1 you index "foo foo foo foo foo bar bar bar
yak" in the category field. As long as you use a similarity that defines
tf() as an identity function, and doesn't use length norm, this also works
really well. (There are also tricks you can do using custom update
processors or tokenizers to let you send "foo=5" over the wire and have it
index the "foo" token with a termFreq of 5)
Looking forward: the "best" way to solve this problem in theory is using
Payloads, but there aren't a lot of options currently availbable for
leveraging payloads in Solrs query APIs / Parsers, so you'd probably have
to write something custom.
How you actaully execute the queries depends on hte approach you take at
indexig -- lets assume you do the keyword stuffing approach...
: - filter: category=some_category_name, query: *.* - Results should be score by
: the above mentioned weight
q=cat:some_category_name
& sort=score desc
...with a simple tf() func the default score will do exactly what you want
of you could use the same {!boost} solution as below with "*:*" ....
: - filter: category=some_category_name, query: some_keyword - Results should be
: scored by a combination of the score of 'some_keyword' and the above mentioned
: weight
you just have to define what you mean by "combination" in terms of solr
query functions. easies is multiplicitively with the {!boost} parser...
q={!boost b=tf(cat,'some_category_name')}some_keyword
& fq=cat:some_category_name
& sort = score desc
: - filter: none, query: some_category_name - Documents with category
: 'some_category_name' should be found as well as documents which contain the term
: 'some_category_name'. Results should be scored by a combination of the score of
: 'some_keyword' and the above mentioned weight
...you could do this by including your category field in the qf of a
dismax search.
assuming you want a isngle solution that works for all of these, and your
"query: some_keyword" example includes the possibility that some_keyword
is also a cateogry name (and you want it's weight taking it account as
well) then an all inclusive solution would probably be something like...
q={!boost b=tf(cat,'some_category_name') defType=}some_keyword
& qf = cat^10 otherfields^5
& fq=cat:some_category_name
& sort = score desc
-Hoss
Re: "Weighted" facet strings
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Ah wait, I forgot about dismax 'bq' parameter! That might be a way to
accomplish your first and second use cases. You probably still need the
seperate _text_weight_X fields for your third use case.
Sorry I don't have a complete recipe for you, but hopefully these tools
will help get you somewhere.
http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
On 8/8/2011 11:16 AM, Jonathan Rochkind wrote:
> One kind of hacky way to accomplish some of those tasks involves
> creating a lot more Solr fields. (This kind of 'de-normalization' is
> often the answer to how to make Solr do something).
>
> So facet fields are ordinarily not tokenized or normalized at all. But
> that doesn't work very well for matching query terms. So if you want
> actual queries to match on these categories, you probably want an
> additional field that is tokenized/analyzed. If you want to boost
> different category assignments differently, you probably want
> _multiple_ additional tokenized/analyzed fields.
>
> So for instance, create separate analyzed fields for each category
> 'weight', perhaps using the default 'text' analysis type.
>
> categor_text_weight_1
> category_text_weight_2
> etc
>
> Then use dismax to query, include all those category_text_* fields in
> the 'qf', and boost the higher weight ones more than the lower weight
> ones.
>
> That will handle a number of your use cases, but not all of them.
>
> Your first two cases are the most problematic:
>
> "filter: category=some_category_name, query: *.* - Results should be
> score by the above mentioned weight "
>
> So Solr doesn't really work like that. Normally a filter does not
> effect the scoring of the actual results _at all_. But if you change
> the query to:
>
> &fq=category:some_category
> &q=some_category
> &defType=dismax
> &qf=category_text_weight1, category_text_weight2^10,
> category_text_weight3^20
>
> THEN, with the multiple analyzed category_text_weight_* fields, as
> described above, I think it should do what you want. You may have to
> play with exactly what boost to give to each field.
>
> But your second use case is still tricky.
>
> Solr doesn't really do exactly what you ask, but by using this method
> I think you can figure out hacky ways to accomplish it. I'm not sure
> if it will solve all of your use cases, but maybe this will give you a
> start to figuring it out.
>
>
> On 8/5/2011 6:55 AM, Michael Lorz wrote:
>> Hi all,
>>
>> I have documents which are (manually) tagged whith categories. Each
>> category-document relation has a weight between 1 and 5:
>>
>> 5: document fits perfectly in this category,
>> .
>> .
>> 1: document may be considered as belonging to this category.
>>
>>
>> I would now like to use this information with solr. At the moment, I
>> don't use
>> the weight at all:
>>
>> <field name="category" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>>
>> Both the category as well as the document body are specified as query
>> fields
>> (<str name="qf"> in solrconfig.xml).
>>
>>
>> What I would like is the following:
>>
>> - filter: category=some_category_name, query: *.* - Results should
>> be score by
>> the above mentioned weight
>> - filter: category=some_category_name, query: some_keyword - Results
>> should be
>> scored by a combination of the score of 'some_keyword' and the above
>> mentioned
>> weight
>> - filter: none, query: some_category_name - Documents with category
>> 'some_category_name' should be found as well as documents which
>> contain the term
>> 'some_category_name'. Results should be scored by a combination of
>> the score of
>> 'some_keyword' and the above mentioned weight
>>
>>
>> Do you have any ideas how this could be done?
>>
>> Thanks in advance
>> Michi
Re: "Weighted" facet strings
Posted by Jonathan Rochkind <ro...@jhu.edu>.
One kind of hacky way to accomplish some of those tasks involves
creating a lot more Solr fields. (This kind of 'de-normalization' is
often the answer to how to make Solr do something).
So facet fields are ordinarily not tokenized or normalized at all. But
that doesn't work very well for matching query terms. So if you want
actual queries to match on these categories, you probably want an
additional field that is tokenized/analyzed. If you want to boost
different category assignments differently, you probably want _multiple_
additional tokenized/analyzed fields.
So for instance, create separate analyzed fields for each category
'weight', perhaps using the default 'text' analysis type.
categor_text_weight_1
category_text_weight_2
etc
Then use dismax to query, include all those category_text_* fields in
the 'qf', and boost the higher weight ones more than the lower weight ones.
That will handle a number of your use cases, but not all of them.
Your first two cases are the most problematic:
"filter: category=some_category_name, query: *.* - Results should be
score by the above mentioned weight "
So Solr doesn't really work like that. Normally a filter does not effect
the scoring of the actual results _at all_. But if you change the query to:
&fq=category:some_category
&q=some_category
&defType=dismax
&qf=category_text_weight1, category_text_weight2^10,
category_text_weight3^20
THEN, with the multiple analyzed category_text_weight_* fields, as
described above, I think it should do what you want. You may have to
play with exactly what boost to give to each field.
But your second use case is still tricky.
Solr doesn't really do exactly what you ask, but by using this method I
think you can figure out hacky ways to accomplish it. I'm not sure if
it will solve all of your use cases, but maybe this will give you a
start to figuring it out.
On 8/5/2011 6:55 AM, Michael Lorz wrote:
> Hi all,
>
> I have documents which are (manually) tagged whith categories. Each
> category-document relation has a weight between 1 and 5:
>
> 5: document fits perfectly in this category,
> .
> .
> 1: document may be considered as belonging to this category.
>
>
> I would now like to use this information with solr. At the moment, I don't use
> the weight at all:
>
> <field name="category" type="string" indexed="true" stored="true"
> multiValued="true"/>
>
> Both the category as well as the document body are specified as query fields
> (<str name="qf"> in solrconfig.xml).
>
>
> What I would like is the following:
>
> - filter: category=some_category_name, query: *.* - Results should be score by
> the above mentioned weight
> - filter: category=some_category_name, query: some_keyword - Results should be
> scored by a combination of the score of 'some_keyword' and the above mentioned
> weight
> - filter: none, query: some_category_name - Documents with category
> 'some_category_name' should be found as well as documents which contain the term
> 'some_category_name'. Results should be scored by a combination of the score of
> 'some_keyword' and the above mentioned weight
>
>
> Do you have any ideas how this could be done?
>
> Thanks in advance
> Michi