You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Roberto Fonti <ro...@buongiorno.com> on 2007/04/06 11:59:19 UTC

UN_TOKENIZED and StandardAnalyzer

Hi All,
I'm indexing categories with this code:

for (Category category : item.getCategories()) {            
    lucene_doc.add(new Field(
        "CATEGORY",
        category.getName(),
        Field.Store.NO,
        Field.Index.UN_TOKENIZED));               
}

And searching using the query:

String query = "CATEGORY:("+category.getName()+")";

I've configured to use the StandardAnalyzer both in the IndexWriter for 
the QueryParser.

Everything goes fine BUT with categories that contains whitespaces (or 
other chars that get tokenized).

* If category is "sport" - ok, I get the result from the search
* If category is "winter sport" - I get no result from search

I've tried with a number of search syntax:
+CATEGORY:"winter sport"
+CATEGORY:winter +CATEGORY:sport
+CATEGORY:(winter sport)
and other...
but none of them work.

What's wrong with that?
By the way, using the KeywordAnalyzer it works, but it is not the 
correct analyzer for my application.
Shouldn't the Analyzer be ignored for a Field.Index.UN_TOKENIZED field?

Thanks,
Roberto

Re: UN_TOKENIZED and StandardAnalyzer

Posted by Roberto Fonti <ro...@buongiorno.com>.

Thanks Erick for your help.
Actually I was already using Luke!
The only thing I was missing was the possibility of using different 
Analyzers at the same time, with PerFieldAnalyzerWrapper.

Thank you again.
Best,
Roberto

Erick Erickson ha scritto:
> Really, really, really get a copy of Luke. Really. Use it to open
> your index and run experimental queries, especially to see
> how they get rewritten (but be sure to pick the appropriate
> analyzer).
>
> Google "lucene luke". Really, really get a copy. It'll help you
> make MUCH faster progress than waiting for someone to
> get around to answering posts here <G>....
>
> Then use query.toString() to see how things pars in your programs
> (or use Luke's search tab to see similar).
>
> If you do, you'll see that parsing the string "winter sport" becomes
> CATEGORY:winter CATEGORY:sport
>
> using StandardAnalyzer with the default operator OR will not
> match an untokenized field of "winter category".
> AND won't work either, it's an UN_TOKENIZED field
>
> Using KeywordAnzlyer gives
> CATEGORY:winter CATEGORY:sport
> This does NOT match the document
> I indexed UN_TOKENIZED 'winter sport'. What the heck?
>
> But using KeywordAnalyzer with
> "\"winter sport\""
>
> That is, where the double quotes are port of the query and using
> KeywordAnalyzer gives
> CATEGORY:winter sport
> and produces the correct match on my UN_TOKENIZED field. What's
> happening here is the double quotes cause the parser to pass both
> words as a single token rather than two tokens.
>
> This has been a source of some confusion on the list, there
> is a very long thread you'll find if you search the archives
> for KeywordAnalyzer.
>
>
> Best
> Erick
>
> On 4/6/07, Roberto Fonti <ro...@buongiorno.com> wrote:
>>
>> Hi All,
>> I'm indexing categories with this code:
>>
>> for (Category category : item.getCategories()) {
>>     lucene_doc.add(new Field(
>>         "CATEGORY",
>>         category.getName(),
>>         Field.Store.NO,
>>         Field.Index.UN_TOKENIZED));
>> }
>>
>> And searching using the query:
>>
>> String query = "CATEGORY:("+category.getName()+")";
>>
>> I've configured to use the StandardAnalyzer both in the IndexWriter for
>> the QueryParser.
>>
>> Everything goes fine BUT with categories that contains whitespaces (or
>> other chars that get tokenized).
>>
>> * If category is "sport" - ok, I get the result from the search
>> * If category is "winter sport" - I get no result from search
>>
>> I've tried with a number of search syntax:
>> +CATEGORY:"winter sport"
>> +CATEGORY:winter +CATEGORY:sport
>> +CATEGORY:(winter sport)
>> and other...
>> but none of them work.
>>
>> What's wrong with that?
>> By the way, using the KeywordAnalyzer it works, but it is not the
>> correct analyzer for my application.
>> Shouldn't the Analyzer be ignored for a Field.Index.UN_TOKENIZED field?
>>
>> Thanks,
>> Roberto
>>
>>
>>
>

-- 
Roberto Fonti
Technology Group
Buongiorno SpA

roberto.fonti@buongiorno.com
http://www.buongiorno.com
 
Il Girasole, Palazzo Marco Polo 324
20084 Lacchiarella, Milano ITALY

Re: UN_TOKENIZED and StandardAnalyzer

Posted by Erick Erickson <er...@gmail.com>.

Really, really, really get a copy of Luke. Really. Use it to open
your index and run experimental queries, especially to see
how they get rewritten (but be sure to pick the appropriate
analyzer).

Google "lucene luke". Really, really get a copy. It'll help you
make MUCH faster progress than waiting for someone to
get around to answering posts here <G>....

Then use query.toString() to see how things pars in your programs
(or use Luke's search tab to see similar).

If you do, you'll see that parsing the string "winter sport" becomes
CATEGORY:winter CATEGORY:sport

using StandardAnalyzer with the default operator OR will not
match an untokenized field of "winter category".
AND won't work either, it's an UN_TOKENIZED field

Using KeywordAnzlyer gives
CATEGORY:winter CATEGORY:sport
This does NOT match the document
I indexed UN_TOKENIZED 'winter sport'. What the heck?

But using KeywordAnalyzer with
"\"winter sport\""

That is, where the double quotes are port of the query and using
KeywordAnalyzer gives
CATEGORY:winter sport
and produces the correct match on my UN_TOKENIZED field. What's
happening here is the double quotes cause the parser to pass both
words as a single token rather than two tokens.

This has been a source of some confusion on the list, there
is a very long thread you'll find if you search the archives
for KeywordAnalyzer.

Best
Erick

On 4/6/07, Roberto Fonti <ro...@buongiorno.com> wrote:
>
> Hi All,
> I'm indexing categories with this code:
>
> for (Category category : item.getCategories()) {
>     lucene_doc.add(new Field(
>         "CATEGORY",
>         category.getName(),
>         Field.Store.NO,
>         Field.Index.UN_TOKENIZED));
> }
>
> And searching using the query:
>
> String query = "CATEGORY:("+category.getName()+")";
>
> I've configured to use the StandardAnalyzer both in the IndexWriter for
> the QueryParser.
>
> Everything goes fine BUT with categories that contains whitespaces (or
> other chars that get tokenized).
>
> * If category is "sport" - ok, I get the result from the search
> * If category is "winter sport" - I get no result from search
>
> I've tried with a number of search syntax:
> +CATEGORY:"winter sport"
> +CATEGORY:winter +CATEGORY:sport
> +CATEGORY:(winter sport)
> and other...
> but none of them work.
>
> What's wrong with that?
> By the way, using the KeywordAnalyzer it works, but it is not the
> correct analyzer for my application.
> Shouldn't the Analyzer be ignored for a Field.Index.UN_TOKENIZED field?
>
> Thanks,
> Roberto
>
>
>