You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by AHMET ARSLAN <io...@yahoo.com> on 2009/11/01 01:20:53 UTC

Re: Match all terms in doc

> Hi
> 
> How do I restrict hits to documents containing all words
> (regardless of order) of a query in particular field?
> 
> Suppose I have two documents with a field called name in my
> index:
> 
> doc1 => name: Pink
> doc2 => name: Pink Floyd
> 
> When querying for "Pink" I want only doc1 and when querying
> for "Pink Floyd" or "Floyd Pink" I want doc2.
> 
> Thanks
> 
> - Magnus


I would implement this kind of functionality by preprocessing documents and queries to calculate number of unique terms in each document and query before I sent them to solr. I would add an extra integer field to hold that number.

For example when indexing document

doc1 => 
name: Pink  
numberOfuniqueTerms: 1

doc2 => 
name: Pink Floyd 
numberOfuniqueTerms: 2

You will set query parser's default operator to AND, that will guarantee  that all query terms will appear in returned document. And numberOfuniqueTerms criteria will guarantee that returned document does not contain any additional terms.

query: pink will be expanded as => name:Pink AND numberOfuniqueTerms:1
query: Pink Floyd will be expanded as  => name:(Pink AND Floyd) AND numberOfuniqueTerms:2


Your preporecessor program can use Lucene API, TermVectors. Since you are interested only size of it

TermFreqVector nameTV = indexSearcher.getIndexReader().getTermFreqVector(docId, "name");
numberOfuniqueTerms = nameTV.size() 

should give you that number.

But this requires pre-indexing a document in Lucene using the same analyzer defined in schema.xml - just to get number of unique terms in it -
Obviously it is not the best solution. And you must use JAVA.


The second solution can be: (without pre-processing and without adding integer field)


Since storing term vectors at index time, allows you to access termvectors at query time there should be easier way [TermVectorComponent] to access a returned document's term vector size, but i do not know how to query that size.

http://wiki.apache.org/solr/TermVectorComponent will give you unique terms in a particular field of a returned document, but you will need to iterate that list to check if it contains all query terms and nothing else. 


Hope this helps.


      

Re: Match all terms in doc

Posted by Magnus Eklund <ma...@gmail.com>.
On Nov 1, 2009, at 1:20 AM, AHMET ARSLAN wrote:

>> Hi
>>
>> How do I restrict hits to documents containing all words
>> (regardless of order) of a query in particular field?
>>
>> Suppose I have two documents with a field called name in my
>> index:
>>
>> doc1 => name: Pink
>> doc2 => name: Pink Floyd
>>
>> When querying for "Pink" I want only doc1 and when querying
>> for "Pink Floyd" or "Floyd Pink" I want doc2.
>>
>> Thanks
>>
>> - Magnus
>
>
> I would implement this kind of functionality by preprocessing  
> documents and queries to calculate number of unique terms in each  
> document and query before I sent them to solr. I would add an extra  
> integer field to hold that number.
>
> For example when indexing document
>
> doc1 =>
> name: Pink
> numberOfuniqueTerms: 1
>
> doc2 =>
> name: Pink Floyd
> numberOfuniqueTerms: 2
>
> You will set query parser's default operator to AND, that will  
> guarantee  that all query terms will appear in returned document.  
> And numberOfuniqueTerms criteria will guarantee that returned  
> document does not contain any additional terms.
>
> query: pink will be expanded as => name:Pink AND numberOfuniqueTerms:1
> query: Pink Floyd will be expanded as  => name:(Pink AND Floyd) AND  
> numberOfuniqueTerms:2
>
>
> Your preporecessor program can use Lucene API, TermVectors. Since  
> you are interested only size of it
>
> TermFreqVector nameTV = indexSearcher.getIndexReader 
> ().getTermFreqVector(docId, "name");
> numberOfuniqueTerms = nameTV.size()
>
> should give you that number.
>
> But this requires pre-indexing a document in Lucene using the same  
> analyzer defined in schema.xml - just to get number of unique terms  
> in it -
> Obviously it is not the best solution. And you must use JAVA.
>
>
> The second solution can be: (without pre-processing and without  
> adding integer field)
>
>
> Since storing term vectors at index time, allows you to access  
> termvectors at query time there should be easier way  
> [TermVectorComponent] to access a returned document's term vector  
> size, but i do not know how to query that size.
>
> http://wiki.apache.org/solr/TermVectorComponent will give you unique  
> terms in a particular field of a returned document, but you will  
> need to iterate that list to check if it contains all query terms  
> and nothing else.
>
>

Thank you very much for the reply. Sorry for the late answer, it took  
some time before I had a chance to try your suggestions.

I decided to try your second solution and it works very well!

- Magnus