You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by karl øie <ka...@gan.no> on 2002/09/26 13:50:11 UTC

Problems with exact matces on non-tokenized fields...

Hi, i have a problem with getting a exact match on a non-tokenized 
field.

I have a Lucene Document with a field named "element" which is stored 
and indexed but not tokenized. The value of the field is "POST" 
(uppercase). But the only way i can match the field is by entering 
"element:POST?" or "element:POST*" in the QueryParser class.

Have anyone here run into this problem?

I am using the 1.2 release version of Lucene.

Mvh Karl Øie


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Problems with exact matces on non-tokenized fields...

Posted by Alex Murzaku <li...@lissus.com>.

Thanks! Now that I think of it, I was searching in the documentation for
a method to reset the document 'd' to "empty" once it is indexed so that
it could be reused but I didn't find one and then the bug slipped
through. I was afraid that all these objects might not be garbage
collected in time. In a test much smaller than infinite:
        for (i=0; i<=100000000; i++) {
            Document d = new Document();
            d.add(Field.Keyword("nr", Integer.toString(i)));
            d.add(Field.Keyword("element","POST"));
            writer.addDocument(d);
        }
I got very soon java.lang.OutOfMemoryError but, by just forcing garbage
collection at the end of the cycle, the memory usage is now a very flat
line... Sorry for bothering you.

-----Original Message-----
From: Doug Cutting [mailto:cutting@lucene.com] 
Sent: Friday, September 27, 2002 2:24 PM
To: Lucene Users List
Subject: Re: Problems with exact matces on non-tokenized fields...


lex Murzaku wrote:
> I was trying this as well but now I get something I can't understand: 
> My query (Query: +element:POST +nr:3) is supposed to match only one 
> record. Indeed Lucene returns that record with the highest score but 
> it also returns others that shouldn't be there at all even if it was 
> an OR query. Another observation: it returns all records where "nr" >=

> 3. Notice the last record returned doesn't contain neither "POST" nor 
> "3". I am attaching a self contained running example with this problem

> and would appreciate any comment.
>  
> 0.6869936 Keyword<nr:3> Keyword<element:POST>
> 0.63916886 Keyword<nr:4> Keyword<element:POST>
> 0.6044586 Keyword<nr:6> Keyword<element:POST>
> 0.5773442 Keyword<nr:5> Keyword<element:POST>
> 0.56318253 Keyword<nr:9> Keyword<element:POST>
> 0.54449975 Keyword<nr:8> Keyword<element:POST>
> 0.5247468 Keyword<nr:7> Keyword<element:POST>
> 0.45054603 Keyword<nr:10> Keyword<element:GET>

Phew!  It took me a while to spot this one...

The bug is with your test program.  You keep adding fields to the same 
document instance.  If you change your program to print the entire 
document, you'll see:

Query: +element:POST +nr:3
0.6869936 Document<Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.63916886 Document<Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>
0.6044586 Document<Keyword<element:POST> Keyword<nr:6> 
Keyword<element:POST> Keyword<nr:5> Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>
0.5773442 Document<Keyword<element:POST> Keyword<nr:5> 
Keyword<element:POST> Keyword<nr:4> Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.56318253 Document<Keyword<element:POST> Keyword<nr:9> 
Keyword<element:POST> Keyword<nr:8> Keyword<element:POST> Keyword<nr:7> 
Keyword<element:POST> Keyword<nr:6> Keyword<element:POST> Keyword<nr:5> 
Keyword<element:POST> Keyword<nr:4> Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.54449975 Document<Keyword<element:POST> Keyword<nr:8> 
Keyword<element:POST> Keyword<nr:7> Keyword<element:POST> Keyword<nr:6> 
Keyword<element:POST> Keyword<nr:5> Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>
0.5247468 Document<Keyword<element:POST> Keyword<nr:7> 
Keyword<element:POST> Keyword<nr:6> Keyword<element:POST> Keyword<nr:5> 
Keyword<element:POST> Keyword<nr:4> Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.45054603 Document<Keyword<element:GET> Keyword<nr:10> 
Keyword<element:POST> Keyword<nr:9> Keyword<element:POST> Keyword<nr:8> 
Keyword<element:POST> Keyword<nr:7> Keyword<element:POST> Keyword<nr:6> 
Keyword<element:POST> Keyword<nr:5> Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>

So you need to create a new document instance each time.  I've attached 
a modified version of your test program that does this and gives the 
results you desire:

Query: +element:POST +nr:3
1.0 Document<Keyword<element:POST> Keyword<nr:3>>

Doug


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Doug Cutting <cu...@lucene.com>.

lex Murzaku wrote:
> I was trying this as well but now I get something I can't understand:
> My query (Query: +element:POST +nr:3) is supposed to match only one
> record. Indeed Lucene returns that record with the highest score but it
> also returns others that shouldn't be there at all even if it was an OR
> query. Another observation: it returns all records where "nr" >= 3.
> Notice the last record returned doesn't contain neither "POST" nor "3".
> I am attaching a self contained running example with this problem and
> would appreciate any comment.
>  
> 0.6869936 Keyword<nr:3> Keyword<element:POST>
> 0.63916886 Keyword<nr:4> Keyword<element:POST>
> 0.6044586 Keyword<nr:6> Keyword<element:POST>
> 0.5773442 Keyword<nr:5> Keyword<element:POST>
> 0.56318253 Keyword<nr:9> Keyword<element:POST>
> 0.54449975 Keyword<nr:8> Keyword<element:POST>
> 0.5247468 Keyword<nr:7> Keyword<element:POST>
> 0.45054603 Keyword<nr:10> Keyword<element:GET>

Phew!  It took me a while to spot this one...

The bug is with your test program.  You keep adding fields to the same 
document instance.  If you change your program to print the entire 
document, you'll see:

Query: +element:POST +nr:3
0.6869936 Document<Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.63916886 Document<Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>
0.6044586 Document<Keyword<element:POST> Keyword<nr:6> 
Keyword<element:POST> Keyword<nr:5> Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>
0.5773442 Document<Keyword<element:POST> Keyword<nr:5> 
Keyword<element:POST> Keyword<nr:4> Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.56318253 Document<Keyword<element:POST> Keyword<nr:9> 
Keyword<element:POST> Keyword<nr:8> Keyword<element:POST> Keyword<nr:7> 
Keyword<element:POST> Keyword<nr:6> Keyword<element:POST> Keyword<nr:5> 
Keyword<element:POST> Keyword<nr:4> Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.54449975 Document<Keyword<element:POST> Keyword<nr:8> 
Keyword<element:POST> Keyword<nr:7> Keyword<element:POST> Keyword<nr:6> 
Keyword<element:POST> Keyword<nr:5> Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>
0.5247468 Document<Keyword<element:POST> Keyword<nr:7> 
Keyword<element:POST> Keyword<nr:6> Keyword<element:POST> Keyword<nr:5> 
Keyword<element:POST> Keyword<nr:4> Keyword<element:POST> Keyword<nr:3> 
Keyword<element:POST> Keyword<nr:2> Keyword<element:POST> Keyword<nr:1> 
Keyword<element:POST> Keyword<nr:0>>
0.45054603 Document<Keyword<element:GET> Keyword<nr:10> 
Keyword<element:POST> Keyword<nr:9> Keyword<element:POST> Keyword<nr:8> 
Keyword<element:POST> Keyword<nr:7> Keyword<element:POST> Keyword<nr:6> 
Keyword<element:POST> Keyword<nr:5> Keyword<element:POST> Keyword<nr:4> 
Keyword<element:POST> Keyword<nr:3> Keyword<element:POST> Keyword<nr:2> 
Keyword<element:POST> Keyword<nr:1> Keyword<element:POST> Keyword<nr:0>>

So you need to create a new document instance each time.  I've attached 
a modified version of your test program that does this and gives the 
results you desire:

Query: +element:POST +nr:3
1.0 Document<Keyword<element:POST> Keyword<nr:3>>

Doug

RE: Problems with exact matces on non-tokenized fields...

Posted by Alex Murzaku <li...@lissus.com>.

I was trying this as well but now I get something I can't understand:
My query (Query: +element:POST +nr:3) is supposed to match only one
record. Indeed Lucene returns that record with the highest score but it
also returns others that shouldn't be there at all even if it was an OR
query. Another observation: it returns all records where "nr" >= 3.
Notice the last record returned doesn't contain neither "POST" nor "3".
I am attaching a self contained running example with this problem and
would appreciate any comment.

0.6869936 Keyword<nr:3> Keyword<element:POST>
0.63916886 Keyword<nr:4> Keyword<element:POST>
0.6044586 Keyword<nr:6> Keyword<element:POST>
0.5773442 Keyword<nr:5> Keyword<element:POST>
0.56318253 Keyword<nr:9> Keyword<element:POST>
0.54449975 Keyword<nr:8> Keyword<element:POST>
0.5247468 Keyword<nr:7> Keyword<element:POST>
0.45054603 Keyword<nr:10> Keyword<element:GET>

-----Original Message-----
From: Doug Cutting [mailto:cutting@lucene.com] 
Sent: Thursday, September 26, 2002 12:44 PM
To: Lucene Users List
Subject: Re: Problems with exact matces on non-tokenized fields...

karl �ie wrote:
> I have a Lucene Document with a field named "element" which is stored
> and indexed but not tokenized. The value of the field is "POST" 
> (uppercase). But the only way i can match the field is by entering 
> "element:POST?" or "element:POST*" in the QueryParser class.

There are two ways to do this.

If this must be entered by users in the query string, then you need to 
use a non-lowercasing analyzer for this field.  The way to do this if 
you're currently using StandardAnalyzer, is to do something like:

   public class MyAnalyzer extends Analyzer {
     private Analyzer standard = new StandardAnalyzer();
     public TokenStream tokenStream(String field, final Reader reader) {
       if ("element".equals(field)) {        // don't tokenize
         return new CharTokenizer(reader) {
           protected boolean isTokenChar(char c) { return true; }
         };
       } else {                              // use standard analyzer
         return standard.tokenStream(field, reader);
       }
     }
   }

   Analyzer analyzer = new MyAnalyzer();
   Query query = queryParser.parse("... +element:POST", analyzer);

Alternately, if this query field is added by a program, then this can be

done by bypassing the analyzer for this class, building this clause 
directly instead:

   Analyzer analyzer = new StandardAnalyzer();
   BooleanQuery query = (BooleanQuery)queryParser.parse("...",
analyzer);

   // now add the element clause
   query.add(new TermQuery(new Term("element", "POST"))), true, false);

Perhaps this should become an FAQ...

Doug

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by karl øie <ka...@gan.no>.

it works :-) when i see this i understand that the term being parsed by 
the queryparser is sent trough the analyzer as well... thanks!

mvh karl øie

On torsdag, sep 26, 2002, at 18:44 Europe/Oslo, Doug Cutting wrote:

> karl øie wrote:
>> I have a Lucene Document with a field named "element" which is stored 
>> and indexed but not tokenized. The value of the field is "POST" 
>> (uppercase). But the only way i can match the field is by entering 
>> "element:POST?" or "element:POST*" in the QueryParser class.
>
> There are two ways to do this.
>
> If this must be entered by users in the query string, then you need to 
> use a non-lowercasing analyzer for this field.  The way to do this if 
> you're currently using StandardAnalyzer, is to do something like:
>
>   public class MyAnalyzer extends Analyzer {
>     private Analyzer standard = new StandardAnalyzer();
>     public TokenStream tokenStream(String field, final Reader reader) {
>       if ("element".equals(field)) {        // don't tokenize
>         return new CharTokenizer(reader) {
>           protected boolean isTokenChar(char c) { return true; }
>         };
>       } else {                              // use standard analyzer
>         return standard.tokenStream(field, reader);
>       }
>     }
>   }
>
>   Analyzer analyzer = new MyAnalyzer();
>   Query query = queryParser.parse("... +element:POST", analyzer);
>
> Alternately, if this query field is added by a program, then this can 
> be done by bypassing the analyzer for this class, building this clause 
> directly instead:
>
>   Analyzer analyzer = new StandardAnalyzer();
>   BooleanQuery query = (BooleanQuery)queryParser.parse("...", 
> analyzer);
>
>   // now add the element clause
>   query.add(new TermQuery(new Term("element", "POST"))), true, false);
>
> Perhaps this should become an FAQ...
>
> Doug
>
>
> --
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Stefanos Karasavvidis <st...@msc.gr>.

 > Doesn't that one do just that - treats fields differently, based on
 > their name?

yes it does, but look at the question's title
"How do I write my own Analyzer?"

if someone has a problem with a non-tokenized field (which was the 
problem of the mail thread that started this) then he doesn't know that 
he has to write a custom analyzer, and so he won't be able to find the 
correct faq entry.

Moreover, the second solution Doug has proposed suites better in some 
cases and should be included, too. (Doug has written these solutions in 
a mail to the users list on 27/9/2002 9:24 p.m.)

I still think that there should be a faq entry as I propose in my 
previous email.

Moreover, there should be an addition to the faq entry
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q15

it states there that it is important to use the same analyzer during 
indexing and searching. Again this may lead to problems if a field is 
not tokenized (during indexing it will _not_ get passed through the 
analyzer, but during searching it get's passed. If the analyzer does not 
treat that field as a special case, there will be a problem.)

I don't know, maybe I'm missing something here, but it seems obvious to 
me that non tokenized fields in conjuction with analyzers produce 
problems which should be mentioned in documenation/faq etc.

Stefanos

Otis Gospodnetic wrote:

>Not sure which FAQ entry you are refering to.
>This one http://www.jguru.com/faq/view.jsp?EID=1006122 ?
>
>Doesn't that one do just that - treats fields differently, based on
>their name?
>
>Otis
>
>--- Stefanos Karasavvidis <st...@msc.gr> wrote:
>  
>
>>I came accross the same problem and I think that the faq entry you 
>>(Otis) propose should get a better title so that users can find more 
>>easily an answer to this problem.
>>
>>Correct me if I'm wrong (and please forgive any wrong assumptions I
>>may 
>>have made), put the problem is on "how to query on a non tokenized
>>field?"
>>
>>Problem explanation:
>>If a field is not tokenized than it is not passed through the
>>analyzer, 
>>independently of the used analyzer (that's what I understand by
>>looking 
>>into DocumentWriter.invertDocument()).
>>If  you construct a query with a given analyzer  (for example with 
>>QueryParser.parse(query, field, analyzer))  with this field, the 
>>queryparser does not know that this field is not tokenized and passes
>>it 
>>through the analyzer. Ther analyzer may alter the query (for example
>>if 
>>the analyzer has a stemming algorithm) and the document is not
>>matched 
>>uppon the query.
>>
>>The solution:
>>The solution is to make sure that fields that aren't tokenized during
>>
>>indexig, are not passed through the analyzer during searching. This
>>can 
>>be done in 2 ways, either by making an analyzer that takes care of
>>this 
>>according to the field,  or by constructing a TermQuery with this
>>field 
>>and adding it to the rest of the query
>>
>>Example:
>>put here the 2 examples from Doug
>>
>>Stefanos 
>>
>>
>>
>>Otis Gospodnetic wrote:
>>
>>    
>>
>>>Thanks, it's a FAQ entry now:
>>>
>>>How do I write my own Analyzer?
>>>http://www.jguru.com/faq/view.jsp?EID=1006122
>>>
>>>Otis
>>>
>>>
>>>--- Doug Cutting <cu...@lucene.com> wrote:
>>> 
>>>
>>>      
>>>
>>>>karl øie wrote:
>>>>   
>>>>
>>>>        
>>>>
>>>>>I have a Lucene Document with a field named "element" which is
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>stored 
>>>>   
>>>>
>>>>        
>>>>
>>>>>and indexed but not tokenized. The value of the field is "POST" 
>>>>>(uppercase). But the only way i can match the field is by entering
>>>>>          
>>>>>
>>>>>"element:POST?" or "element:POST*" in the QueryParser class.
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>There are two ways to do this.
>>>>
>>>>If this must be entered by users in the query string, then you need
>>>>to 
>>>>use a non-lowercasing analyzer for this field.  The way to do this
>>>>        
>>>>
>>if
>>    
>>
>>>>you're currently using StandardAnalyzer, is to do something like:
>>>>
>>>>  public class MyAnalyzer extends Analyzer {
>>>>    private Analyzer standard = new StandardAnalyzer();
>>>>    public TokenStream tokenStream(String field, final Reader
>>>>reader) {
>>>>      if ("element".equals(field)) {        // don't tokenize
>>>>        return new CharTokenizer(reader) {
>>>>          protected boolean isTokenChar(char c) { return true; }
>>>>        };
>>>>      } else {                              // use standard
>>>>        
>>>>
>>analyzer
>>    
>>
>>>>        return standard.tokenStream(field, reader);
>>>>      }
>>>>    }
>>>>  }
>>>>
>>>>  Analyzer analyzer = new MyAnalyzer();
>>>>  Query query = queryParser.parse("... +element:POST", analyzer);
>>>>
>>>>Alternately, if this query field is added by a program, then this
>>>>        
>>>>
>>can
>>    
>>
>>>>be 
>>>>done by bypassing the analyzer for this class, building this clause
>>>>        
>>>>
>>>>directly instead:
>>>>
>>>>  Analyzer analyzer = new StandardAnalyzer();
>>>>  BooleanQuery query = (BooleanQuery)queryParser.parse("...",
>>>>analyzer);
>>>>
>>>>  // now add the element clause
>>>>  query.add(new TermQuery(new Term("element", "POST"))), true,
>>>>false);
>>>>
>>>>Perhaps this should become an FAQ...
>>>>
>>>>Doug
>>>>
>>>>
>>>>--
>>>>To unsubscribe, e-mail:  
>>>><ma...@jakarta.apache.org>
>>>>For additional commands, e-mail:
>>>><ma...@jakarta.apache.org>
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>__________________________________________________
>>>Do you Yahoo!?
>>>New DSL Internet Access from SBC & Yahoo!
>>>http://sbc.yahoo.com
>>>
>>>--
>>>To unsubscribe, e-mail:  
>>>      
>>>
>><ma...@jakarta.apache.org>
>>    
>>
>>>For additional commands, e-mail:
>>>      
>>>
>><ma...@jakarta.apache.org>
>>    
>>
>>> 
>>>
>>>      
>>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
>>    
>>
>
>
>__________________________________________________
>Do you Yahoo!?
>U2 on LAUNCH - Exclusive greatest hits videos
>http://launch.yahoo.com/u2
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>

-- 
======================================================================
Stefanos Karasavvidis
Electronics & Computer Engineer
e-mail : stefos@msc.gr

Multimedia Systems Center S.A.
Kissamou 178
73100 Chania - Crete - Hellas
http://www.multimedia-sa.gr

Tel : +30 821 0 88447
Fax : +30 821 0 88427



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Not sure which FAQ entry you are refering to.
This one http://www.jguru.com/faq/view.jsp?EID=1006122 ?

Doesn't that one do just that - treats fields differently, based on
their name?

Otis

--- Stefanos Karasavvidis <st...@msc.gr> wrote:
> I came accross the same problem and I think that the faq entry you 
> (Otis) propose should get a better title so that users can find more 
> easily an answer to this problem.
> 
> Correct me if I'm wrong (and please forgive any wrong assumptions I
> may 
> have made), put the problem is on "how to query on a non tokenized
> field?"
> 
> Problem explanation:
> If a field is not tokenized than it is not passed through the
> analyzer, 
> independently of the used analyzer (that's what I understand by
> looking 
> into DocumentWriter.invertDocument()).
> If  you construct a query with a given analyzer  (for example with 
> QueryParser.parse(query, field, analyzer))  with this field, the 
> queryparser does not know that this field is not tokenized and passes
> it 
> through the analyzer. Ther analyzer may alter the query (for example
> if 
> the analyzer has a stemming algorithm) and the document is not
> matched 
> uppon the query.
> 
> The solution:
> The solution is to make sure that fields that aren't tokenized during
> 
> indexig, are not passed through the analyzer during searching. This
> can 
> be done in 2 ways, either by making an analyzer that takes care of
> this 
> according to the field,  or by constructing a TermQuery with this
> field 
> and adding it to the rest of the query
> 
> Example:
> put here the 2 examples from Doug
> 
> Stefanos 
> 
> 
> 
> Otis Gospodnetic wrote:
> 
> >Thanks, it's a FAQ entry now:
> >
> >How do I write my own Analyzer?
> >http://www.jguru.com/faq/view.jsp?EID=1006122
> >
> >Otis
> >
> >
> >--- Doug Cutting <cu...@lucene.com> wrote:
> >  
> >
> >>karl �ie wrote:
> >>    
> >>
> >>>I have a Lucene Document with a field named "element" which is
> >>>      
> >>>
> >>stored 
> >>    
> >>
> >>>and indexed but not tokenized. The value of the field is "POST" 
> >>>(uppercase). But the only way i can match the field is by entering
> 
> >>>"element:POST?" or "element:POST*" in the QueryParser class.
> >>>      
> >>>
> >>There are two ways to do this.
> >>
> >>If this must be entered by users in the query string, then you need
> >>to 
> >>use a non-lowercasing analyzer for this field.  The way to do this
> if
> >>
> >>you're currently using StandardAnalyzer, is to do something like:
> >>
> >>   public class MyAnalyzer extends Analyzer {
> >>     private Analyzer standard = new StandardAnalyzer();
> >>     public TokenStream tokenStream(String field, final Reader
> >>reader) {
> >>       if ("element".equals(field)) {        // don't tokenize
> >>         return new CharTokenizer(reader) {
> >>           protected boolean isTokenChar(char c) { return true; }
> >>         };
> >>       } else {                              // use standard
> analyzer
> >>         return standard.tokenStream(field, reader);
> >>       }
> >>     }
> >>   }
> >>
> >>   Analyzer analyzer = new MyAnalyzer();
> >>   Query query = queryParser.parse("... +element:POST", analyzer);
> >>
> >>Alternately, if this query field is added by a program, then this
> can
> >>be 
> >>done by bypassing the analyzer for this class, building this clause
> 
> >>directly instead:
> >>
> >>   Analyzer analyzer = new StandardAnalyzer();
> >>   BooleanQuery query = (BooleanQuery)queryParser.parse("...",
> >>analyzer);
> >>
> >>   // now add the element clause
> >>   query.add(new TermQuery(new Term("element", "POST"))), true,
> >>false);
> >>
> >>Perhaps this should become an FAQ...
> >>
> >>Doug
> >>
> >>
> >>--
> >>To unsubscribe, e-mail:  
> >><ma...@jakarta.apache.org>
> >>For additional commands, e-mail:
> >><ma...@jakarta.apache.org>
> >>
> >>    
> >>
> >
> >
> >__________________________________________________
> >Do you Yahoo!?
> >New DSL Internet Access from SBC & Yahoo!
> >http://sbc.yahoo.com
> >
> >--
> >To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> >For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> >
> >  
> >
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Stefanos Karasavvidis <st...@msc.gr>.

I came accross the same problem and I think that the faq entry you 
(Otis) propose should get a better title so that users can find more 
easily an answer to this problem.

Correct me if I'm wrong (and please forgive any wrong assumptions I may 
have made), put the problem is on "how to query on a non tokenized field?"

Problem explanation:
If a field is not tokenized than it is not passed through the analyzer, 
independently of the used analyzer (that's what I understand by looking 
into DocumentWriter.invertDocument()).
If  you construct a query with a given analyzer  (for example with 
QueryParser.parse(query, field, analyzer))  with this field, the 
queryparser does not know that this field is not tokenized and passes it 
through the analyzer. Ther analyzer may alter the query (for example if 
the analyzer has a stemming algorithm) and the document is not matched 
uppon the query.

The solution:
The solution is to make sure that fields that aren't tokenized during 
indexig, are not passed through the analyzer during searching. This can 
be done in 2 ways, either by making an analyzer that takes care of this 
according to the field,  or by constructing a TermQuery with this field 
and adding it to the rest of the query

Example:
put here the 2 examples from Doug

Stefanos 



Otis Gospodnetic wrote:

>Thanks, it's a FAQ entry now:
>
>How do I write my own Analyzer?
>http://www.jguru.com/faq/view.jsp?EID=1006122
>
>Otis
>
>
>--- Doug Cutting <cu...@lucene.com> wrote:
>  
>
>>karl øie wrote:
>>    
>>
>>>I have a Lucene Document with a field named "element" which is
>>>      
>>>
>>stored 
>>    
>>
>>>and indexed but not tokenized. The value of the field is "POST" 
>>>(uppercase). But the only way i can match the field is by entering 
>>>"element:POST?" or "element:POST*" in the QueryParser class.
>>>      
>>>
>>There are two ways to do this.
>>
>>If this must be entered by users in the query string, then you need
>>to 
>>use a non-lowercasing analyzer for this field.  The way to do this if
>>
>>you're currently using StandardAnalyzer, is to do something like:
>>
>>   public class MyAnalyzer extends Analyzer {
>>     private Analyzer standard = new StandardAnalyzer();
>>     public TokenStream tokenStream(String field, final Reader
>>reader) {
>>       if ("element".equals(field)) {        // don't tokenize
>>         return new CharTokenizer(reader) {
>>           protected boolean isTokenChar(char c) { return true; }
>>         };
>>       } else {                              // use standard analyzer
>>         return standard.tokenStream(field, reader);
>>       }
>>     }
>>   }
>>
>>   Analyzer analyzer = new MyAnalyzer();
>>   Query query = queryParser.parse("... +element:POST", analyzer);
>>
>>Alternately, if this query field is added by a program, then this can
>>be 
>>done by bypassing the analyzer for this class, building this clause 
>>directly instead:
>>
>>   Analyzer analyzer = new StandardAnalyzer();
>>   BooleanQuery query = (BooleanQuery)queryParser.parse("...",
>>analyzer);
>>
>>   // now add the element clause
>>   query.add(new TermQuery(new Term("element", "POST"))), true,
>>false);
>>
>>Perhaps this should become an FAQ...
>>
>>Doug
>>
>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
>>    
>>
>
>
>__________________________________________________
>Do you Yahoo!?
>New DSL Internet Access from SBC & Yahoo!
>http://sbc.yahoo.com
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Thanks, it's a FAQ entry now:

How do I write my own Analyzer?
http://www.jguru.com/faq/view.jsp?EID=1006122

Otis


--- Doug Cutting <cu...@lucene.com> wrote:
> karl �ie wrote:
> > I have a Lucene Document with a field named "element" which is
> stored 
> > and indexed but not tokenized. The value of the field is "POST" 
> > (uppercase). But the only way i can match the field is by entering 
> > "element:POST?" or "element:POST*" in the QueryParser class.
> 
> There are two ways to do this.
> 
> If this must be entered by users in the query string, then you need
> to 
> use a non-lowercasing analyzer for this field.  The way to do this if
> 
> you're currently using StandardAnalyzer, is to do something like:
> 
>    public class MyAnalyzer extends Analyzer {
>      private Analyzer standard = new StandardAnalyzer();
>      public TokenStream tokenStream(String field, final Reader
> reader) {
>        if ("element".equals(field)) {        // don't tokenize
>          return new CharTokenizer(reader) {
>            protected boolean isTokenChar(char c) { return true; }
>          };
>        } else {                              // use standard analyzer
>          return standard.tokenStream(field, reader);
>        }
>      }
>    }
> 
>    Analyzer analyzer = new MyAnalyzer();
>    Query query = queryParser.parse("... +element:POST", analyzer);
> 
> Alternately, if this query field is added by a program, then this can
> be 
> done by bypassing the analyzer for this class, building this clause 
> directly instead:
> 
>    Analyzer analyzer = new StandardAnalyzer();
>    BooleanQuery query = (BooleanQuery)queryParser.parse("...",
> analyzer);
> 
>    // now add the element clause
>    query.add(new TermQuery(new Term("element", "POST"))), true,
> false);
> 
> Perhaps this should become an FAQ...
> 
> Doug
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Doug Cutting <cu...@lucene.com>.

karl øie wrote:
> I have a Lucene Document with a field named "element" which is stored 
> and indexed but not tokenized. The value of the field is "POST" 
> (uppercase). But the only way i can match the field is by entering 
> "element:POST?" or "element:POST*" in the QueryParser class.

There are two ways to do this.

If this must be entered by users in the query string, then you need to 
use a non-lowercasing analyzer for this field.  The way to do this if 
you're currently using StandardAnalyzer, is to do something like:

   public class MyAnalyzer extends Analyzer {
     private Analyzer standard = new StandardAnalyzer();
     public TokenStream tokenStream(String field, final Reader reader) {
       if ("element".equals(field)) {        // don't tokenize
         return new CharTokenizer(reader) {
           protected boolean isTokenChar(char c) { return true; }
         };
       } else {                              // use standard analyzer
         return standard.tokenStream(field, reader);
       }
     }
   }

   Analyzer analyzer = new MyAnalyzer();
   Query query = queryParser.parse("... +element:POST", analyzer);

Alternately, if this query field is added by a program, then this can be 
done by bypassing the analyzer for this class, building this clause 
directly instead:

   Analyzer analyzer = new StandardAnalyzer();
   BooleanQuery query = (BooleanQuery)queryParser.parse("...", analyzer);

   // now add the element clause
   query.add(new TermQuery(new Term("element", "POST"))), true, false);

Perhaps this should become an FAQ...

Doug


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by Dave Peixotto <pe...@geofolio.com>.

I have also observed this behavior.

----- Original Message -----
From: "karl øie" <ka...@gan.no>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 26, 2002 4:50 AM
Subject: Problems with exact matces on non-tokenized fields...

Hi, i have a problem with getting a exact match on a non-tokenized
field.

I have a Lucene Document with a field named "element" which is stored
and indexed but not tokenized. The value of the field is "POST"
(uppercase). But the only way i can match the field is by entering
"element:POST?" or "element:POST*" in the QueryParser class.

Have anyone here run into this problem?

I am using the 1.2 release version of Lucene.

Mvh Karl Øie

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Problems with exact matces on non-tokenized fields...

Posted by Alex Murzaku <li...@lissus.com>.

sorry about that - it was early in the morning...
my guess is that the analyzer you are passing to queryparser lowercases
"POST" but doesn't "POST*" or "POST?". could you try seeing the values
of your query when it is going to the searcher?

-----Original Message-----
From: karl øie [mailto:karl@gan.no] 
Sent: Thursday, September 26, 2002 8:22 AM
To: Lucene Users List
Subject: Re: Problems with exact matces on non-tokenized fields...


Hm.. a misunderstanding: i don't create the field with the value 
"POST?" i create it with "POST". "element:POST?" or "element:POST*" are 
the strings i send to the QueryParser for searching.

mvh Karl Øie

On torsdag, sep 26, 2002, at 14:13 Europe/Oslo, Alex Murzaku wrote:

> But indeed "POST" does not match to "POST?". If you are not tokenizing

> the field, the character "?" remains there together with everything 
> else.
>
> -----Original Message-----
> From: karl øie [mailto:karl@gan.no]
> Sent: Thursday, September 26, 2002 7:50 AM
> To: Lucene Users List
> Subject: Problems with exact matces on non-tokenized fields...
>
>
> Hi, i have a problem with getting a exact match on a non-tokenized 
> field.
>
> I have a Lucene Document with a field named "element" which is stored 
> and indexed but not tokenized. The value of the field is "POST" 
> (uppercase). But the only way i can match the field is by entering 
> "element:POST?" or "element:POST*" in the QueryParser class.
>
> Have anyone here run into this problem?
>
> I am using the 1.2 release version of Lucene.
>
> Mvh Karl Øie
>
>
> --
> To unsubscribe, e-mail: 
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Posted by karl øie <ka...@gan.no>.

Hm.. a misunderstanding: i don't create the field with the value 
"POST?" i create it with "POST". "element:POST?" or "element:POST*" are 
the strings i send to the QueryParser for searching.

mvh Karl Øie

On torsdag, sep 26, 2002, at 14:13 Europe/Oslo, Alex Murzaku wrote:

> But indeed "POST" does not match to "POST?". If you are not tokenizing
> the field, the character "?" remains there together with everything
> else.
>
> -----Original Message-----
> From: karl øie [mailto:karl@gan.no]
> Sent: Thursday, September 26, 2002 7:50 AM
> To: Lucene Users List
> Subject: Problems with exact matces on non-tokenized fields...
>
>
> Hi, i have a problem with getting a exact match on a non-tokenized
> field.
>
> I have a Lucene Document with a field named "element" which is stored
> and indexed but not tokenized. The value of the field is "POST"
> (uppercase). But the only way i can match the field is by entering
> "element:POST?" or "element:POST*" in the QueryParser class.
>
> Have anyone here run into this problem?
>
> I am using the 1.2 release version of Lucene.
>
> Mvh Karl Øie
>
>
> --
> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail:   
> <ma...@jakarta.apache.org>
> For additional commands, e-mail: 
> <ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Problems with exact matces on non-tokenized fields...

Posted by Alex Murzaku <li...@lissus.com>.

But indeed "POST" does not match to "POST?". If you are not tokenizing
the field, the character "?" remains there together with everything
else.

-----Original Message-----
From: karl øie [mailto:karl@gan.no] 
Sent: Thursday, September 26, 2002 7:50 AM
To: Lucene Users List
Subject: Problems with exact matces on non-tokenized fields...

Hi, i have a problem with getting a exact match on a non-tokenized 
field.

I have a Lucene Document with a field named "element" which is stored 
and indexed but not tokenized. The value of the field is "POST" 
(uppercase). But the only way i can match the field is by entering 
"element:POST?" or "element:POST*" in the QueryParser class.

Have anyone here run into this problem?

I am using the 1.2 release version of Lucene.

Mvh Karl Øie

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>