You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dirk Högemann <di...@googlemail.com> on 2012/12/17 11:59:17 UTC

Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Hi,

I am not sure if am missing something, or maybe I do not exactly understand
the index/search analyzer definition and their execution.

I have a field definition like this:


    <fieldType name="cl2tokenized_string" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
group="-1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="search">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
group="-1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Any field starting with cl2 should be recognized as being of type
cl2Tokenized_string:
<dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
stored="true" />

When I try to search for a token in that sense the query is tokenized at
whitespaces:

<arr name="filter_queries"><str>{!q.op=AND
df=cl2Categories_NACE}cl2Categories_NACE:08 Gewinnung von Steinen und
Erden, sonstiger Bergbau</str></arr><arr
name="parsed_filter_queries"><str>+cl2Categories_NACE:08
+cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
+cl2Categories_NACE:steinen +cl2Categories_NACE:und
+cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
+cl2Categories_NACE:bergbau</str></arr>

I expected the query parser would also tokenize ONLY at the pattern ###,
instead of using a white space tokenizer here?
Is is possible to define a filter query, without using phrases, to achieve
the desired behavior?
Maybe local parameters are not the way to go here?

Best
Dirk

Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Posted by Lee Carroll <le...@googlemail.com>.

I use *analyzer type*="*query*" can you use search ?




On 17 December 2012 11:01, Dirk Högemann <di...@googlemail.com>wrote:

> <arr name="filter_queries"><str>{!q.op=AND df=cl2Categories_NACE}08
> Gewinnung von Steinen und Erden, sonstiger Bergbau</str></arr><arr
> name="parsed_filter_queries"><str>+cl2Categories_NACE:08
> +cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
> +cl2Categories_NACE:steinen +cl2Categories_NACE:und
> +cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
> +cl2Categories_NACE:bergbau</str></arr>
>
> That is the relevant debug Output from the query.
>
> 2012/12/17 Dirk Högemann <di...@googlemail.com>
>
> > Hi,
> >
> > I am not sure if am missing something, or maybe I do not exactly
> > understand the index/search analyzer definition and their execution.
> >
> > I have a field definition like this:
> >
> >
> >     <fieldType name="cl2tokenized_string" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true">
> >       <analyzer type="index">
> >         <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
> > group="-1"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="search">
> >         <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
> > group="-1"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
> > Any field starting with cl2 should be recognized as being of type
> > cl2Tokenized_string:
> > <dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
> > stored="true" />
> >
> > When I try to search for a token in that sense the query is tokenized at
> > whitespaces:
> >
> > <arr name="filter_queries"><str>{!q.op=AND
> > df=cl2Categories_NACE}cl2Categories_NACE:08 Gewinnung von Steinen und
> > Erden, sonstiger Bergbau</str></arr><arr
> > name="parsed_filter_queries"><str>+cl2Categories_NACE:08
> > +cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
> > +cl2Categories_NACE:steinen +cl2Categories_NACE:und
> > +cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
> > +cl2Categories_NACE:bergbau</str></arr>
> >
> > I expected the query parser would also tokenize ONLY at the pattern ###,
> > instead of using a white space tokenizer here?
> > Is is possible to define a filter query, without using phrases, to
> achieve
> > the desired behavior?
> > Maybe local parameters are not the way to go here?
> >
> > Best
> > Dirk
> >
>

Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Posted by Dirk Högemann <di...@googlemail.com>.

<arr name="filter_queries"><str>{!q.op=AND df=cl2Categories_NACE}08
Gewinnung von Steinen und Erden, sonstiger Bergbau</str></arr><arr
name="parsed_filter_queries"><str>+cl2Categories_NACE:08
+cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
+cl2Categories_NACE:steinen +cl2Categories_NACE:und
+cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
+cl2Categories_NACE:bergbau</str></arr>

That is the relevant debug Output from the query.

2012/12/17 Dirk Högemann <di...@googlemail.com>

> Hi,
>
> I am not sure if am missing something, or maybe I do not exactly
> understand the index/search analyzer definition and their execution.
>
> I have a field definition like this:
>
>
>     <fieldType name="cl2tokenized_string" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>       <analyzer type="index">
>         <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
> group="-1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="search">
>         <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
> group="-1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> Any field starting with cl2 should be recognized as being of type
> cl2Tokenized_string:
> <dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
> stored="true" />
>
> When I try to search for a token in that sense the query is tokenized at
> whitespaces:
>
> <arr name="filter_queries"><str>{!q.op=AND
> df=cl2Categories_NACE}cl2Categories_NACE:08 Gewinnung von Steinen und
> Erden, sonstiger Bergbau</str></arr><arr
> name="parsed_filter_queries"><str>+cl2Categories_NACE:08
> +cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
> +cl2Categories_NACE:steinen +cl2Categories_NACE:und
> +cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
> +cl2Categories_NACE:bergbau</str></arr>
>
> I expected the query parser would also tokenize ONLY at the pattern ###,
> instead of using a white space tokenizer here?
> Is is possible to define a filter query, without using phrases, to achieve
> the desired behavior?
> Maybe local parameters are not the way to go here?
>
> Best
> Dirk
>

Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Posted by Dirk Högemann <di...@googlemail.com>.

Ah - now I got it. My solution to this was to use phrase queries - now I
know why: Thanks!
2012/12/17 Jack Krupansky <ja...@basetechnology.com>

> No, the "query" analyzer tokenizer will simply be applied to each term or
> quoted string AFTER the query parser has already parsed it. You may have
> escaped or quoted characters which will then be seen by the analyzer
> tokenizer.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dirk Högemann
> Sent: Monday, December 17, 2012 11:01 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always
> at whitespace?
>
>
> Ok- right, changed that... Nevertheless I thought I should always use the
> same analyzers for the query and the index section to have consistent
> results.
> Does this mean that the tokenizer in the query section will always be
> ignored by the given query parsers?
>
>
>
> 2012/12/17 Jack Krupansky <ja...@basetechnology.com>
>
>  The query parsers normally tokenize on white space and query operators,
>> but you can escape any white space with backslash or put the text in
>> quotes
>> and then it will be tokenized by the analyzer rather than the query
>> parser.
>>
>> Also, you have:
>>
>> <analyzer type="search">
>>
>> Change "search" to "query", but that won't change your problem since Solr
>> defaults to using the "index" analyzer if it doesn't "see" a "query"
>> analyzer.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Dirk Högemann
>> Sent: Monday, December 17, 2012 5:59 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at
>> whitespace?
>>
>>
>> Hi,
>>
>> I am not sure if am missing something, or maybe I do not exactly
>> understand
>> the index/search analyzer definition and their execution.
>>
>> I have a field definition like this:
>>
>>
>>    <fieldType name="cl2tokenized_string" class="solr.TextField"
>> sortMissingLast="true" omitNorms="true">
>>      <analyzer type="index">
>>        <tokenizer class="solr.****PatternTokenizerFactory" pattern="###"
>> group="-1"/>
>>        <filter class="solr.****LowerCaseFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="search">
>>        <tokenizer class="solr.****PatternTokenizerFactory" pattern="###"
>> group="-1"/>
>>        <filter class="solr.****LowerCaseFilterFactory"/>
>>
>>      </analyzer>
>>    </fieldType>
>>
>> Any field starting with cl2 should be recognized as being of type
>> cl2Tokenized_string:
>> <dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
>> stored="true" />
>>
>> When I try to search for a token in that sense the query is tokenized at
>> whitespaces:
>>
>> <arr name="filter_queries"><str>{!****q.op=AND
>> df=cl2Categories_NACE}****cl2Categories_NACE:08 Gewinnung von Steinen
>> und
>>
>> Erden, sonstiger Bergbau</str></arr><arr
>> name="parsed_filter_queries"><****str>+cl2Categories_NACE:08
>>
>> +cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
>> +cl2Categories_NACE:steinen +cl2Categories_NACE:und
>> +cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
>> +cl2Categories_NACE:bergbau</****str></arr>
>>
>>
>> I expected the query parser would also tokenize ONLY at the pattern ###,
>> instead of using a white space tokenizer here?
>> Is is possible to define a filter query, without using phrases, to achieve
>> the desired behavior?
>> Maybe local parameters are not the way to go here?
>>
>> Best
>> Dirk
>>
>>
>

Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Posted by Jack Krupansky <ja...@basetechnology.com>.

No, the "query" analyzer tokenizer will simply be applied to each term or 
quoted string AFTER the query parser has already parsed it. You may have 
escaped or quoted characters which will then be seen by the analyzer 
tokenizer.

-- Jack Krupansky

-----Original Message----- 
From: Dirk Högemann
Sent: Monday, December 17, 2012 11:01 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at 
whitespace?

Ok- right, changed that... Nevertheless I thought I should always use the
same analyzers for the query and the index section to have consistent
results.
Does this mean that the tokenizer in the query section will always be
ignored by the given query parsers?



2012/12/17 Jack Krupansky <ja...@basetechnology.com>

> The query parsers normally tokenize on white space and query operators,
> but you can escape any white space with backslash or put the text in 
> quotes
> and then it will be tokenized by the analyzer rather than the query 
> parser.
>
> Also, you have:
>
> <analyzer type="search">
>
> Change "search" to "query", but that won't change your problem since Solr
> defaults to using the "index" analyzer if it doesn't "see" a "query"
> analyzer.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dirk Högemann
> Sent: Monday, December 17, 2012 5:59 AM
> To: solr-user@lucene.apache.org
> Subject: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at
> whitespace?
>
>
> Hi,
>
> I am not sure if am missing something, or maybe I do not exactly 
> understand
> the index/search analyzer definition and their execution.
>
> I have a field definition like this:
>
>
>    <fieldType name="cl2tokenized_string" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>      <analyzer type="index">
>        <tokenizer class="solr.**PatternTokenizerFactory" pattern="###"
> group="-1"/>
>        <filter class="solr.**LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="search">
>        <tokenizer class="solr.**PatternTokenizerFactory" pattern="###"
> group="-1"/>
>        <filter class="solr.**LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> Any field starting with cl2 should be recognized as being of type
> cl2Tokenized_string:
> <dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
> stored="true" />
>
> When I try to search for a token in that sense the query is tokenized at
> whitespaces:
>
> <arr name="filter_queries"><str>{!**q.op=AND
> df=cl2Categories_NACE}**cl2Categories_NACE:08 Gewinnung von Steinen und
> Erden, sonstiger Bergbau</str></arr><arr
> name="parsed_filter_queries"><**str>+cl2Categories_NACE:08
> +cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
> +cl2Categories_NACE:steinen +cl2Categories_NACE:und
> +cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
> +cl2Categories_NACE:bergbau</**str></arr>
>
> I expected the query parser would also tokenize ONLY at the pattern ###,
> instead of using a white space tokenizer here?
> Is is possible to define a filter query, without using phrases, to achieve
> the desired behavior?
> Maybe local parameters are not the way to go here?
>
> Best
> Dirk
>

Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Posted by Dirk Högemann <di...@googlemail.com>.

Ok- right, changed that... Nevertheless I thought I should always use the
same analyzers for the query and the index section to have consistent
results.
Does this mean that the tokenizer in the query section will always be
ignored by the given query parsers?



2012/12/17 Jack Krupansky <ja...@basetechnology.com>

> The query parsers normally tokenize on white space and query operators,
> but you can escape any white space with backslash or put the text in quotes
> and then it will be tokenized by the analyzer rather than the query parser.
>
> Also, you have:
>
> <analyzer type="search">
>
> Change "search" to "query", but that won't change your problem since Solr
> defaults to using the "index" analyzer if it doesn't "see" a "query"
> analyzer.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dirk Högemann
> Sent: Monday, December 17, 2012 5:59 AM
> To: solr-user@lucene.apache.org
> Subject: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at
> whitespace?
>
>
> Hi,
>
> I am not sure if am missing something, or maybe I do not exactly understand
> the index/search analyzer definition and their execution.
>
> I have a field definition like this:
>
>
>    <fieldType name="cl2tokenized_string" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>      <analyzer type="index">
>        <tokenizer class="solr.**PatternTokenizerFactory" pattern="###"
> group="-1"/>
>        <filter class="solr.**LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="search">
>        <tokenizer class="solr.**PatternTokenizerFactory" pattern="###"
> group="-1"/>
>        <filter class="solr.**LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> Any field starting with cl2 should be recognized as being of type
> cl2Tokenized_string:
> <dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
> stored="true" />
>
> When I try to search for a token in that sense the query is tokenized at
> whitespaces:
>
> <arr name="filter_queries"><str>{!**q.op=AND
> df=cl2Categories_NACE}**cl2Categories_NACE:08 Gewinnung von Steinen und
> Erden, sonstiger Bergbau</str></arr><arr
> name="parsed_filter_queries"><**str>+cl2Categories_NACE:08
> +cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
> +cl2Categories_NACE:steinen +cl2Categories_NACE:und
> +cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
> +cl2Categories_NACE:bergbau</**str></arr>
>
> I expected the query parser would also tokenize ONLY at the pattern ###,
> instead of using a white space tokenizer here?
> Is is possible to define a filter query, without using phrases, to achieve
> the desired behavior?
> Maybe local parameters are not the way to go here?
>
> Best
> Dirk
>

Re: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at whitespace?

Posted by Jack Krupansky <ja...@basetechnology.com>.

The query parsers normally tokenize on white space and query operators, but 
you can escape any white space with backslash or put the text in quotes and 
then it will be tokenized by the analyzer rather than the query parser.

Also, you have:

<analyzer type="search">

Change "search" to "query", but that won't change your problem since Solr 
defaults to using the "index" analyzer if it doesn't "see" a "query" 
analyzer.

-- Jack Krupansky

-----Original Message----- 
From: Dirk Högemann
Sent: Monday, December 17, 2012 5:59 AM
To: solr-user@lucene.apache.org
Subject: Solr3.5 PatternTokenizer / Search Analyzer tokenizing always at 
whitespace?

Hi,

I am not sure if am missing something, or maybe I do not exactly understand
the index/search analyzer definition and their execution.

I have a field definition like this:


    <fieldType name="cl2tokenized_string" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
group="-1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="search">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="###"
group="-1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Any field starting with cl2 should be recognized as being of type
cl2Tokenized_string:
<dynamicField name="cl2*" type="cl2tokenized_string" indexed="true"
stored="true" />

When I try to search for a token in that sense the query is tokenized at
whitespaces:

<arr name="filter_queries"><str>{!q.op=AND
df=cl2Categories_NACE}cl2Categories_NACE:08 Gewinnung von Steinen und
Erden, sonstiger Bergbau</str></arr><arr
name="parsed_filter_queries"><str>+cl2Categories_NACE:08
+cl2Categories_NACE:gewinnung +cl2Categories_NACE:von
+cl2Categories_NACE:steinen +cl2Categories_NACE:und
+cl2Categories_NACE:erden, +cl2Categories_NACE:sonstiger
+cl2Categories_NACE:bergbau</str></arr>

I expected the query parser would also tokenize ONLY at the pattern ###,
instead of using a white space tokenizer here?
Is is possible to define a filter query, without using phrases, to achieve
the desired behavior?
Maybe local parameters are not the way to go here?

Best
Dirk