You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kun Hong <kh...@promptu.com> on 2007/04/27 08:21:35 UTC

Search for docs containing only a certain word in a specified field?

Hi,

I wonder if there is a way to search for documents containing only
a certain word in a specified field.

For example, I would like to search for documents that contain only
"the" in title field. Some titles just contain a single stop word,
but I really need to find it. If the index is created without stop
words removed. I should be able to find documents containing "the"
in the title field. But, how about I just want that one document which
contains no other words than "the". Is it possible using Lucene query?
I know I can get all the results and filter it myself. But this way could
be too expensive, if the return results are enormous (Eg. if one million 
docs
containing "the").

TIA,

Kun



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by Kun Hong <kh...@promptu.com>.


karl wettin wrote:
>
> 30 apr 2007 kl. 02.05 skrev Kun Hong:
>
>>> I'm not sure if you mean that it should treat all repetative tokens 
>>> as only one token? Then you are better of using a filter when 
>>> analyzing text you insert to the index: rather than creating one 
>>> token for each the in "the the the the the the" you only create one. 
>>> You might also want to use this filter when parsing user queries. 
>>> (It will be hard to find the band 'the the'.)
>> I can't just use filters because I have to cater for other titles 
>> that are not just stop words, which
>> should be analyzed normally. (I know this requirement is a bit fussy).
>
> You might want to consider using two fields. One with "normal 
> analysis" and one that filters out everything by titles that contain 
> nothing but one word repeating over and over again. Then it would be 
> sufficient with a single TermQuery to match, and no more need to hack 
> in the extra start- and stop tokens.
>

That's right. I should have think of it.
Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by karl wettin <ka...@gmail.com>.

30 apr 2007 kl. 02.05 skrev Kun Hong:

>> I'm not sure if you mean that it should treat all repetative  
>> tokens as only one token? Then you are better of using a filter  
>> when analyzing text you insert to the index: rather than creating  
>> one token for each the in "the the the the the the" you only  
>> create one. You might also want to use this filter when parsing  
>> user queries. (It will be hard to find the band 'the the'.)
> I can't just use filters because I have to cater for other titles  
> that are not just stop words, which
> should be analyzed normally. (I know this requirement is a bit fussy).

You might want to consider using two fields. One with "normal  
analysis" and one that filters out everything by titles that contain  
nothing but one word repeating over and over again. Then it would be  
sufficient with a single TermQuery to match, and no more need to hack  
in the extra start- and stop tokens.



-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by Kun Hong <kh...@promptu.com>.


karl wettin wrote:
>
> 28 apr 2007 kl. 07.52 skrev Kun Hong:
>
>> karl wettin wrote:
>>>
>>> 27 apr 2007 kl. 14.11 skrev Erik Hatcher:
>>>
>>>>
>>>> On Apr 27, 2007, at 6:39 AM, karl wettin wrote:
>>>>> 27 apr 2007 kl. 12.36 skrev Erik Hatcher:
>>>>>
>>>>>> Unless someone has some other tricks I'm not aware of, that is.
>>>>>
>>>>> I guess it would be possible to add start/stop-tokens such as ^ 
>>>>> and $ to the indexed text: "^ the $" and place a phrase query with 
>>>>> 0 slop.
>>>>
>>>> True true.   That'd work too.
>>
>> Thanks for the replies and discussion.
>>
>> I think I didn't express my problems correctly.  The problem is I 
>> want to
>> find documents containing only the "the" token in the title field, 
>> but not
>> necessarily with only one appearance.  For example, if the query is 
>> "the",
>> I want to find documents whose title is "the", "the the" or "the the 
>> the".
>
> I'm not sure if you mean that it should treat all repetative tokens as 
> only one token? Then you are better of using a filter when analyzing 
> text you insert to the index: rather than creating one token for each 
> the in "the the the the the the" you only create one. You might also 
> want to use this filter when parsing user queries. (It will be hard to 
> find the band 'the the'.)
I can't just use filters because I have to cater for other titles that 
are not just stop words, which
should be analyzed normally. (I know this requirement is a bit fussy).

> If not and what you write above is all you want to match, nothing 
> more, nothing less, then you could do something like this:
>
> (dry coded and untested.)
>
> int n = 3; // the; the the; the the the
> String field = "title";
> String token = "the";
> BooleanQuery bq = new BooleanQuery();
> for (int i=0;i<n;i++) {
>   Term[] terms = new Term[i+2];
>   terms[0] = new Term(field, "^");
>   for (int j=0;j<i;j++) {
>     terms[j+1] = new Term(field, token);
>   }
>   terms[i+2] = new Term(field, "$");
>   bq.add(new BooleanClause(new PhraseQuery(terms, 0), Orrcurs.SHOULD);
> }
>
>

This seems to be a solution, but with n fixed. But I think it is good 
enough for me now. :)

Thanks a lot.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by karl wettin <ka...@gmail.com>.

28 apr 2007 kl. 07.52 skrev Kun Hong:

> karl wettin wrote:
>>
>> 27 apr 2007 kl. 14.11 skrev Erik Hatcher:
>>
>>>
>>> On Apr 27, 2007, at 6:39 AM, karl wettin wrote:
>>>> 27 apr 2007 kl. 12.36 skrev Erik Hatcher:
>>>>
>>>>> Unless someone has some other tricks I'm not aware of, that is.
>>>>
>>>> I guess it would be possible to add start/stop-tokens such as ^  
>>>> and $ to the indexed text: "^ the $" and place a phrase query  
>>>> with 0 slop.
>>>
>>> True true.   That'd work too.
>
> Thanks for the replies and discussion.
>
> I think I didn't express my problems correctly.  The problem is I  
> want to
> find documents containing only the "the" token in the title field,  
> but not
> necessarily with only one appearance.  For example, if the query is  
> "the",
> I want to find documents whose title is "the", "the the" or "the  
> the the".

I'm not sure if you mean that it should treat all repetative tokens  
as only one token? Then you are better of using a filter when  
analyzing text you insert to the index: rather than creating one  
token for each the in "the the the the the the" you only create one.  
You might also want to use this filter when parsing user queries. (It  
will be hard to find the band 'the the'.)

If not and what you write above is all you want to match, nothing  
more, nothing less, then you could do something like this:

(dry coded and untested.)

int n = 3; // the; the the; the the the
String field = "title";
String token = "the";
BooleanQuery bq = new BooleanQuery();
for (int i=0;i<n;i++) {
   Term[] terms = new Term[i+2];
   terms[0] = new Term(field, "^");
   for (int j=0;j<i;j++) {
     terms[j+1] = new Term(field, token);
   }
   terms[i+2] = new Term(field, "$");
   bq.add(new BooleanClause(new PhraseQuery(terms, 0), Orrcurs.SHOULD);
}


I hope this helps.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by Kun Hong <kh...@promptu.com>.

karl wettin wrote:
>
> 27 apr 2007 kl. 14.11 skrev Erik Hatcher:
>
>>
>> On Apr 27, 2007, at 6:39 AM, karl wettin wrote:
>>> 27 apr 2007 kl. 12.36 skrev Erik Hatcher:
>>>
>>>> Unless someone has some other tricks I'm not aware of, that is.
>>>
>>> I guess it would be possible to add start/stop-tokens such as ^ and 
>>> $ to the indexed text: "^ the $" and place a phrase query with 0 slop.
>>
>> True true.   That'd work too.

Thanks for the replies and discussion.

I think I didn't express my problems correctly.  The problem is I want to
find documents containing only the "the" token in the title field, but not
necessarily with only one appearance.  For example, if the query is "the",
I want to find documents whose title is "the", "the the" or "the the the".

I am not sure whether the start/stop-tokens and indexing the whole field 
as keyword
will help, it seems not.  Sorry if I misled you. But from what is 
discussed, it seems
there is no easy way.  I think this is just a corner case for myself 
now, since
nobody seems to find this a problem. But if you still have interest in 
solving
this problem, I will be very glad. :)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by karl wettin <ka...@gmail.com>.

27 apr 2007 kl. 14.11 skrev Erik Hatcher:

>
> On Apr 27, 2007, at 6:39 AM, karl wettin wrote:
>> 27 apr 2007 kl. 12.36 skrev Erik Hatcher:
>>
>>> Unless someone has some other tricks I'm not aware of, that is.
>>
>> I guess it would be possible to add start/stop-tokens such as ^  
>> and $ to the indexed text: "^ the $" and place a phrase query with  
>> 0 slop.
>
> True true.   That'd work too.

I was thinking about this today. And I'm still thinking, don't take  
this too serious. I just want to see if I can implement this a less  
hacky way.

Number of terms in the field is what is missing in order to implement  
a Query the "correct way", right? Clone the norms-code.  
SpanCompleteFieldQuery would extend SpanNearQuery, have slop 0 and  
require [tokens in field] clauses. To me this is more compelling than  
the ^$ hack. However, if there are no other features one can think of  
this information will yeild, the hack might just turn out to be better.

I can't think of anything I'd call a feature:

Norms could be calculated in a higher resolution instead of beeing  
stored as a float. What is most expensive, to convert the byte to  
float or divide a bunch at query time?

Rebuilding term vectors using skipTo() might save some by not seeking  
more than nessecary.

Match only terms in fields that are between n and m tokens long.  
However, this might be better of discretized in a few bins, or  
perhaps even possible to estimated based on the (existing  
implementation) norm value?

What else is there?

-- 

karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Apr 27, 2007, at 6:39 AM, karl wettin wrote:
> 27 apr 2007 kl. 12.36 skrev Erik Hatcher:
>
>> Unless someone has some other tricks I'm not aware of, that is.
>
> I guess it would be possible to add start/stop-tokens such as ^ and  
> $ to the indexed text: "^ the $" and place a phrase query with 0 slop.

True true.   That'd work too.

> But that might screw up SpanFirstQuery et c?

That's true also, but with your mentioned technique SpanFirstQuery  
wouldn't be needed.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by karl wettin <ka...@gmail.com>.

27 apr 2007 kl. 12.36 skrev Erik Hatcher:

> Unless someone has some other tricks I'm not aware of, that is.

I guess it would be possible to add start/stop-tokens such as ^ and $  
to the indexed text: "^ the $" and place a phrase query with 0 slop.  
But that might screw up SpanFirstQuery et c?

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Apr 27, 2007, at 6:08 AM, karl wettin wrote:
> 27 apr 2007 kl. 08.21 skrev Kun Hong:
>
>> I just want that one document which
>> contains no other words than "the". Is it possible using Lucene  
>> query?
>
> Take a look at SpanFirstQuery. Perhaps you would need implement a  
> SpanLastQuery too.
>
> Perhaps the easiest way about it would be a RegexQuery that looks  
> something like this: "^the$"

The RegexQuery won't work in this case, as it is only matching on  
terms, not original field values.  So it'd still pick up other titles  
with "the" in them (provided stop words were not removed).

If you need exact matching on original value, the easiest way is to  
index that value without tokenization (perhaps lowercasing though).   
Unless someone has some other tricks I'm not aware of, that is.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search for docs containing only a certain word in a specified field?

Posted by karl wettin <ka...@gmail.com>.

27 apr 2007 kl. 08.21 skrev Kun Hong:

> I just want that one document which
> contains no other words than "the". Is it possible using Lucene query?

Take a look at SpanFirstQuery. Perhaps you would need implement a  
SpanLastQuery too.

Perhaps the easiest way about it would be a RegexQuery that looks  
something like this: "^the$"


-- 
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org