You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Shane <sh...@lexum.com> on 2014/01/21 17:41:57 UTC

Interesting search question! How to match documents based on the least number of fields that match all query terms?

I have an interesting solr/lucene question and its quite possible that some new features in solr might make this much easier that what I am about to try. If anyone has a clever idea on how to do this search, please let me know!

Basically, lets state that I have an index in which each documents has a content and several metadata fields.

Document Fields:

content
metadata1
metadata2
.....
metadataN
allMetadatas (all the terms indexed in metadata1...N are concatenated in this field) 

Assuming that I am searching for documents that contains a certain number of terms (term1 to termN) in their metadata fields, I would like to build a search query that will return document that satisfy these requirement:

a) All search terms must be present in a metadata field. This is quite easy, we can simply search in the field allMetadatas and that will work fine.

b) Now for the hard part, we prefer document in which we found the metadatas in the *least number of different fields*. So if one document contains all the search terms in 10 different fields, but another document contains all search terms but in only 8 fields, we would like those to sort first. 

My first idea was to index terms in the allMetadatas using payloads. Each indexed term would also have the specific metadataN field from which they originate. Then I can write a scorer to score based on these payloads. 

However, if there is a way to do this without payloads I'm all ears!

-- 
Daniel Shane
Lexum (www.lexum.com)
shaned@lexum.com

Re: Interesting search question! How to match documents based on the least number of fields that match all query terms?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Daniel,

I have an idea to try to use coord() here. Check
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.htmland
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/package-summary.html

So, if you can override similarity to ignore all scoring factors, leaving
coord() only meaningful, and form query like
metadata1:(a b c...) metadata2:(a b c...) metadata3:(a b c...)...
you can check number of hits across metadata# fields. Mind that you might
need to disable coord by new BooleanQuery(true) for nested disjunctions (a
b c...)

Related question was discussed
http://www.youtube.com/watch?v=1ZRmqtPoAj4but it mostly covers norms
(which are sort of primitive form of payloads)



On Tue, Jan 21, 2014 at 8:41 PM, Daniel Shane <sh...@lexum.com> wrote:

> I have an interesting solr/lucene question and its quite possible that
> some new features in solr might make this much easier that what I am about
> to try. If anyone has a clever idea on how to do this search, please let me
> know!
>
> Basically, lets state that I have an index in which each documents has a
> content and several metadata fields.
>
> Document Fields:
>
> content
> metadata1
> metadata2
> .....
> metadataN
> allMetadatas (all the terms indexed in metadata1...N are concatenated in
> this field)
>
> Assuming that I am searching for documents that contains a certain number
> of terms (term1 to termN) in their metadata fields, I would like to build a
> search query that will return document that satisfy these requirement:
>
> a) All search terms must be present in a metadata field. This is quite
> easy, we can simply search in the field allMetadatas and that will work
> fine.
>
> b) Now for the hard part, we prefer document in which we found the
> metadatas in the *least number of different fields*. So if one document
> contains all the search terms in 10 different fields, but another document
> contains all search terms but in only 8 fields, we would like those to sort
> first.
>
> My first idea was to index terms in the allMetadatas using payloads. Each
> indexed term would also have the specific metadataN field from which they
> originate. Then I can write a scorer to score based on these payloads.
>
> However, if there is a way to do this without payloads I'm all ears!
>
> --
> Daniel Shane
> Lexum (www.lexum.com)
> shaned@lexum.com
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Interesting search question! How to match documents based on the least number of fields that match all query terms?

Posted by Daniel Shane <sh...@lexum.com>.
Thanks Frank, Mikhail & Robert for your input!

I'm looking into your ideas, and running a few test queries to see how it works out. I have a feeling that it is more tricky that it sounds, for example, lets say I have 3 docs in my index:

Doc1:

m1: a b c d
m2: a b c
m3: a b
m4: a
mAll: a b c d / a b c / a b / a

Doc 2:

m1: a b c 
m2: b c d
m3: 
m4:
mAll: a b c / b c d

Doc 3:

m1: a 
m2: b
m3: c
m4: d
mAll: a / b / c / d

If the search terms are a b c d, then all 3 docs will match, since each of the search terms are in the metas. However, the sorting should give this order:

doc1 (1 field matches all terms)
doc2 (2 fields match all terms)
doc3 (4 fields match all terms)

I'll try out your ideas and let you know how it works out!

Daniel Shane



----- Original Message -----
From: "Franck Brisbart" <fb...@techmedianetwork.com>
To: solr-user@lucene.apache.org
Sent: Thursday, January 23, 2014 3:12:36 AM
Subject: RE: Interesting search question! How to match documents based on the least number of fields that match all query terms?

Hi Daniel,

you can also consider using negative boosts.
This can't be done with solr, but docs which don't match the metadata
can be boosted.

This might do what you want :
-metadata1:(term1 AND ... AND termN)^2
-metadata2:(term1 AND ... AND termN)^2
.....
-metadataN:(term1 AND ... AND termN)^2
allMetadatas :(term1 AND ... AND termN)^0.5


Franck Brisbart



Le mercredi 22 janvier 2014 à 19:38 +0000, Petersen, Robert a écrit :
> Hi Daniel,
> 
> How about trying something like this (you'll have to play with the boosts to tune this), search all the fields with all the terms using edismax and use the minimum should match parameter, but require all terms to match in the allMetadata field.    https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
> 
> Lucene query syntax below to give you the general idea, but this query would require all terms to be in one of the metadata fields to get the boost.
> 
> metadata1:(term1 AND ... AND termN)^2
> metadata2:(term1 AND ... AND termN)^2
> .....
> metadataN:(term1 AND ... AND termN)^2
> allMetadatas :(term1 AND ... AND termN)^0.5
> 
> That should do approximately what you want,
> Robi
> 
> -----Original Message-----
> From: Daniel Shane [mailto:shaned@lexum.com] 
> Sent: Tuesday, January 21, 2014 8:42 AM
> To: solr-user@lucene.apache.org
> Subject: Interesting search question! How to match documents based on the least number of fields that match all query terms?
> 
> I have an interesting solr/lucene question and its quite possible that some new features in solr might make this much easier that what I am about to try. If anyone has a clever idea on how to do this search, please let me know!
> 
> Basically, lets state that I have an index in which each documents has a content and several metadata fields.
> 
> Document Fields:
> 
> content
> metadata1
> metadata2
> .....
> metadataN
> allMetadatas (all the terms indexed in metadata1...N are concatenated in this field) 
> 
> Assuming that I am searching for documents that contains a certain number of terms (term1 to termN) in their metadata fields, I would like to build a search query that will return document that satisfy these requirement:
> 
> a) All search terms must be present in a metadata field. This is quite easy, we can simply search in the field allMetadatas and that will work fine.
> 
> b) Now for the hard part, we prefer document in which we found the metadatas in the *least number of different fields*. So if one document contains all the search terms in 10 different fields, but another document contains all search terms but in only 8 fields, we would like those to sort first. 
> 
> My first idea was to index terms in the allMetadatas using payloads. Each indexed term would also have the specific metadataN field from which they originate. Then I can write a scorer to score based on these payloads. 
> 
> However, if there is a way to do this without payloads I'm all ears!
> 


RE: Interesting search question! How to match documents based on the least number of fields that match all query terms?

Posted by Franck Brisbart <fb...@techmedianetwork.com>.
Hi Daniel,

you can also consider using negative boosts.
This can't be done with solr, but docs which don't match the metadata
can be boosted.

This might do what you want :
-metadata1:(term1 AND ... AND termN)^2
-metadata2:(term1 AND ... AND termN)^2
.....
-metadataN:(term1 AND ... AND termN)^2
allMetadatas :(term1 AND ... AND termN)^0.5


Franck Brisbart



Le mercredi 22 janvier 2014 à 19:38 +0000, Petersen, Robert a écrit :
> Hi Daniel,
> 
> How about trying something like this (you'll have to play with the boosts to tune this), search all the fields with all the terms using edismax and use the minimum should match parameter, but require all terms to match in the allMetadata field.    https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
> 
> Lucene query syntax below to give you the general idea, but this query would require all terms to be in one of the metadata fields to get the boost.
> 
> metadata1:(term1 AND ... AND termN)^2
> metadata2:(term1 AND ... AND termN)^2
> .....
> metadataN:(term1 AND ... AND termN)^2
> allMetadatas :(term1 AND ... AND termN)^0.5
> 
> That should do approximately what you want,
> Robi
> 
> -----Original Message-----
> From: Daniel Shane [mailto:shaned@lexum.com] 
> Sent: Tuesday, January 21, 2014 8:42 AM
> To: solr-user@lucene.apache.org
> Subject: Interesting search question! How to match documents based on the least number of fields that match all query terms?
> 
> I have an interesting solr/lucene question and its quite possible that some new features in solr might make this much easier that what I am about to try. If anyone has a clever idea on how to do this search, please let me know!
> 
> Basically, lets state that I have an index in which each documents has a content and several metadata fields.
> 
> Document Fields:
> 
> content
> metadata1
> metadata2
> .....
> metadataN
> allMetadatas (all the terms indexed in metadata1...N are concatenated in this field) 
> 
> Assuming that I am searching for documents that contains a certain number of terms (term1 to termN) in their metadata fields, I would like to build a search query that will return document that satisfy these requirement:
> 
> a) All search terms must be present in a metadata field. This is quite easy, we can simply search in the field allMetadatas and that will work fine.
> 
> b) Now for the hard part, we prefer document in which we found the metadatas in the *least number of different fields*. So if one document contains all the search terms in 10 different fields, but another document contains all search terms but in only 8 fields, we would like those to sort first. 
> 
> My first idea was to index terms in the allMetadatas using payloads. Each indexed term would also have the specific metadataN field from which they originate. Then I can write a scorer to score based on these payloads. 
> 
> However, if there is a way to do this without payloads I'm all ears!
> 



RE: Interesting search question! How to match documents based on the least number of fields that match all query terms?

Posted by "Petersen, Robert" <ro...@mail.rakuten.com>.
Hi Daniel,

How about trying something like this (you'll have to play with the boosts to tune this), search all the fields with all the terms using edismax and use the minimum should match parameter, but require all terms to match in the allMetadata field.    https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

Lucene query syntax below to give you the general idea, but this query would require all terms to be in one of the metadata fields to get the boost.

metadata1:(term1 AND ... AND termN)^2
metadata2:(term1 AND ... AND termN)^2
.....
metadataN:(term1 AND ... AND termN)^2
allMetadatas :(term1 AND ... AND termN)^0.5

That should do approximately what you want,
Robi

-----Original Message-----
From: Daniel Shane [mailto:shaned@lexum.com] 
Sent: Tuesday, January 21, 2014 8:42 AM
To: solr-user@lucene.apache.org
Subject: Interesting search question! How to match documents based on the least number of fields that match all query terms?

I have an interesting solr/lucene question and its quite possible that some new features in solr might make this much easier that what I am about to try. If anyone has a clever idea on how to do this search, please let me know!

Basically, lets state that I have an index in which each documents has a content and several metadata fields.

Document Fields:

content
metadata1
metadata2
.....
metadataN
allMetadatas (all the terms indexed in metadata1...N are concatenated in this field) 

Assuming that I am searching for documents that contains a certain number of terms (term1 to termN) in their metadata fields, I would like to build a search query that will return document that satisfy these requirement:

a) All search terms must be present in a metadata field. This is quite easy, we can simply search in the field allMetadatas and that will work fine.

b) Now for the hard part, we prefer document in which we found the metadatas in the *least number of different fields*. So if one document contains all the search terms in 10 different fields, but another document contains all search terms but in only 8 fields, we would like those to sort first. 

My first idea was to index terms in the allMetadatas using payloads. Each indexed term would also have the specific metadataN field from which they originate. Then I can write a scorer to score based on these payloads. 

However, if there is a way to do this without payloads I'm all ears!

-- 
Daniel Shane
Lexum (www.lexum.com)
shaned@lexum.com