You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alexander Herzog <he...@ait.co.at> on 2009/08/20 08:14:10 UTC

Is wildcard search not correctly analyzed at query?

Hi all

sorry for the long post

We are switching from indexdata's zebra to solr for a new book
archival/preservation project with multiple languages, so expect more
questions soon (sorry for that)
The features of solr are pretty cool and more or less overwhelming!

But there is one thing I found after a little test with wildcards.

I'm using the latest svn build and didn't change anything except the
schema.xml
Solr Specification Version: 1.3.0.2009.08.20.07.53.52
Solr Implementation Version: 1.4-dev 806060 - ait015 - 2009-08-20 07:53:52
Lucene Specification Version: 2.9-dev
Lucene Implementation Version: 2.9-dev 804692 - 2009-08-16 09:33:41

I have a text_ws field with this schema config:

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
      <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   </analyzer>
</fieldType>
...
and I added a dynamic field for everything since I'm not sure what field
we will use...

<dynamicField name="*"  type="text_ws"    indexed="true"  stored="true"
multiValued="true"/>
...


So I <add>ed this content:
...
<field name="PhysicalDescription">
   X, 143, XIV S.:
   124 feine Farbendrucktafeln mit über 600 Abbildungen;
   24,5 cm.
</field>
...

since it's German, and I couldn't find a tokenizer for German compound
words (any help appreciated) I wanted to search for 'Farb*'

The final row of the query analyzer in the admin section told me:
farb*
for the content:
x,	143,	xiv	s.:	124	feine	farbendrucktafeln	mit	uber	600	abbildungen;
24,5	cm.

so everything seems to be ok, everything in lower case

Now, for the rest service:
http://localhost:8983/solr/select/?q=PhysicalDescription:Farb*&debugQuery=true
<str name="rawquerystring">PhysicalDescription:Farb*</str>
<str name="querystring">PhysicalDescription:Farb*</str>
<str name="parsedquery">PhysicalDescription:Farb*</str>
<str name="parsedquery_toString">PhysicalDescription:Farb*</str>

Since Farb* has a capital letter, nothing is found.
When using farb* as query, I get the result.

Where can I add/change a query anaylizer that "lower cases" wildcard
searches?

thanks, best wishes,
Alexander

Re: Is wildcard search not correctly analyzed at query? [solved]

Posted by Alexander Herzog <he...@ait.co.at>.

Hi

Thanks for the info!

best,
Alexander

Avlesh Singh schrieb:
> Wildcard queries are not analyzed by Lucene and hence the behavior. A
> similar thread earlier -
> http://www.lucidimagination.com/search/document/a6b9144ecab9d0ff/search_phrase_wildcard
> 
> Cheers
> Avlesh
> 
> On Thu, Aug 20, 2009 at 7:03 PM, Alexander Herzog <he...@ait.co.at> wrote:
> 
>> It seems like the analyzer/filter isn't affected at all, since the query
>>
>> http://localhost:8983/solr/select/?q=PhysicalDescription:nü*&debugQuery=true<http://localhost:8983/solr/select/?q=PhysicalDescription:n%C3%BC*&debugQuery=true>
>>
>> does not return a
>> <str name="parsedquery">PhysicalDescription:nu*</str>
>> as I would expect.
>>
>> So can I just have a "you're right, wildcard search is passed to lucene
>> directly without any analyzing".
>>
>> If it is like this, I'm happy with that as well.
>>
>> best,
>> Alexander
>>
>>
>> Alexander Herzog schrieb:
>>> Hi all
>>>
>>> sorry for the long post
>>>
>>> We are switching from indexdata's zebra to solr for a new book
>>> archival/preservation project with multiple languages, so expect more
>>> questions soon (sorry for that)
>>> The features of solr are pretty cool and more or less overwhelming!
>>>
>>> But there is one thing I found after a little test with wildcards.
>>>
>>> I'm using the latest svn build and didn't change anything except the
>>> schema.xml
>>> Solr Specification Version: 1.3.0.2009.08.20.07.53.52
>>> Solr Implementation Version: 1.4-dev 806060 - ait015 - 2009-08-20
>> 07:53:52
>>> Lucene Specification Version: 2.9-dev
>>> Lucene Implementation Version: 2.9-dev 804692 - 2009-08-16 09:33:41
>>>
>>> I have a text_ws field with this schema config:
>>>
>>> <fieldType name="text_ws" class="solr.TextField"
>> positionIncrementGap="100">
>>>    <analyzer>
>>>       <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>    </analyzer>
>>> </fieldType>
>>> ...
>>> and I added a dynamic field for everything since I'm not sure what field
>>> we will use...
>>>
>>> <dynamicField name="*"  type="text_ws"    indexed="true"  stored="true"
>>> multiValued="true"/>
>>> ...
>>>
>>>
>>> So I <add>ed this content:
>>> ...
>>> <field name="PhysicalDescription">
>>>    X, 143, XIV S.:
>>>    124 feine Farbendrucktafeln mit über 600 Abbildungen;
>>>    24,5 cm.
>>> </field>
>>> ...
>>>
>>> since it's German, and I couldn't find a tokenizer for German compound
>>> words (any help appreciated) I wanted to search for 'Farb*'
>>>
>>> The final row of the query analyzer in the admin section told me:
>>> farb*
>>> for the content:
>>> x,    143,    xiv     s.:     124     feine   farbendrucktafeln       mit
>>     uber    600     abbildungen;
>>> 24,5  cm.
>>>
>>> so everything seems to be ok, everything in lower case
>>>
>>> Now, for the rest service:
>>>
>> http://localhost:8983/solr/select/?q=PhysicalDescription:Farb*&debugQuery=true
>>> <str name="rawquerystring">PhysicalDescription:Farb*</str>
>>> <str name="querystring">PhysicalDescription:Farb*</str>
>>> <str name="parsedquery">PhysicalDescription:Farb*</str>
>>> <str name="parsedquery_toString">PhysicalDescription:Farb*</str>
>>>
>>> Since Farb* has a capital letter, nothing is found.
>>> When using farb* as query, I get the result.
>>>
>>> Where can I add/change a query anaylizer that "lower cases" wildcard
>>> searches?
>>>
>>> thanks, best wishes,
>>> Alexander
>>>
>

Re: Is wildcard search not correctly analyzed at query?

Posted by Avlesh Singh <av...@gmail.com>.

Wildcard queries are not analyzed by Lucene and hence the behavior. A
similar thread earlier -
http://www.lucidimagination.com/search/document/a6b9144ecab9d0ff/search_phrase_wildcard

Cheers
Avlesh

On Thu, Aug 20, 2009 at 7:03 PM, Alexander Herzog <he...@ait.co.at> wrote:

>
> It seems like the analyzer/filter isn't affected at all, since the query
>
> http://localhost:8983/solr/select/?q=PhysicalDescription:nü*&debugQuery=true<http://localhost:8983/solr/select/?q=PhysicalDescription:n%C3%BC*&debugQuery=true>
>
> does not return a
> <str name="parsedquery">PhysicalDescription:nu*</str>
> as I would expect.
>
> So can I just have a "you're right, wildcard search is passed to lucene
> directly without any analyzing".
>
> If it is like this, I'm happy with that as well.
>
> best,
> Alexander
>
>
> Alexander Herzog schrieb:
> > Hi all
> >
> > sorry for the long post
> >
> > We are switching from indexdata's zebra to solr for a new book
> > archival/preservation project with multiple languages, so expect more
> > questions soon (sorry for that)
> > The features of solr are pretty cool and more or less overwhelming!
> >
> > But there is one thing I found after a little test with wildcards.
> >
> > I'm using the latest svn build and didn't change anything except the
> > schema.xml
> > Solr Specification Version: 1.3.0.2009.08.20.07.53.52
> > Solr Implementation Version: 1.4-dev 806060 - ait015 - 2009-08-20
> 07:53:52
> > Lucene Specification Version: 2.9-dev
> > Lucene Implementation Version: 2.9-dev 804692 - 2009-08-16 09:33:41
> >
> > I have a text_ws field with this schema config:
> >
> > <fieldType name="text_ws" class="solr.TextField"
> positionIncrementGap="100">
> >    <analyzer>
> >       <charFilter class="solr.MappingCharFilterFactory"
> > mapping="mapping-ISOLatin1Accent.txt"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >    </analyzer>
> > </fieldType>
> > ...
> > and I added a dynamic field for everything since I'm not sure what field
> > we will use...
> >
> > <dynamicField name="*"  type="text_ws"    indexed="true"  stored="true"
> > multiValued="true"/>
> > ...
> >
> >
> > So I <add>ed this content:
> > ...
> > <field name="PhysicalDescription">
> >    X, 143, XIV S.:
> >    124 feine Farbendrucktafeln mit über 600 Abbildungen;
> >    24,5 cm.
> > </field>
> > ...
> >
> > since it's German, and I couldn't find a tokenizer for German compound
> > words (any help appreciated) I wanted to search for 'Farb*'
> >
> > The final row of the query analyzer in the admin section told me:
> > farb*
> > for the content:
> > x,    143,    xiv     s.:     124     feine   farbendrucktafeln       mit
>     uber    600     abbildungen;
> > 24,5  cm.
> >
> > so everything seems to be ok, everything in lower case
> >
> > Now, for the rest service:
> >
> http://localhost:8983/solr/select/?q=PhysicalDescription:Farb*&debugQuery=true
> > <str name="rawquerystring">PhysicalDescription:Farb*</str>
> > <str name="querystring">PhysicalDescription:Farb*</str>
> > <str name="parsedquery">PhysicalDescription:Farb*</str>
> > <str name="parsedquery_toString">PhysicalDescription:Farb*</str>
> >
> > Since Farb* has a capital letter, nothing is found.
> > When using farb* as query, I get the result.
> >
> > Where can I add/change a query anaylizer that "lower cases" wildcard
> > searches?
> >
> > thanks, best wishes,
> > Alexander
> >
>

Re: Is wildcard search not correctly analyzed at query?

Posted by Alexander Herzog <he...@ait.co.at>.

It seems like the analyzer/filter isn't affected at all, since the query
http://localhost:8983/solr/select/?q=PhysicalDescription:nü*&debugQuery=true

does not return a
<str name="parsedquery">PhysicalDescription:nu*</str>
as I would expect.

So can I just have a "you're right, wildcard search is passed to lucene
directly without any analyzing".

If it is like this, I'm happy with that as well.

best,
Alexander


Alexander Herzog schrieb:
> Hi all
> 
> sorry for the long post
> 
> We are switching from indexdata's zebra to solr for a new book
> archival/preservation project with multiple languages, so expect more
> questions soon (sorry for that)
> The features of solr are pretty cool and more or less overwhelming!
> 
> But there is one thing I found after a little test with wildcards.
> 
> I'm using the latest svn build and didn't change anything except the
> schema.xml
> Solr Specification Version: 1.3.0.2009.08.20.07.53.52
> Solr Implementation Version: 1.4-dev 806060 - ait015 - 2009-08-20 07:53:52
> Lucene Specification Version: 2.9-dev
> Lucene Implementation Version: 2.9-dev 804692 - 2009-08-16 09:33:41
> 
> I have a text_ws field with this schema config:
> 
> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>       <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>    </analyzer>
> </fieldType>
> ...
> and I added a dynamic field for everything since I'm not sure what field
> we will use...
> 
> <dynamicField name="*"  type="text_ws"    indexed="true"  stored="true"
> multiValued="true"/>
> ...
> 
> 
> So I <add>ed this content:
> ...
> <field name="PhysicalDescription">
>    X, 143, XIV S.:
>    124 feine Farbendrucktafeln mit über 600 Abbildungen;
>    24,5 cm.
> </field>
> ...
> 
> since it's German, and I couldn't find a tokenizer for German compound
> words (any help appreciated) I wanted to search for 'Farb*'
> 
> The final row of the query analyzer in the admin section told me:
> farb*
> for the content:
> x,	143,	xiv	s.:	124	feine	farbendrucktafeln	mit	uber	600	abbildungen;
> 24,5	cm.
> 
> so everything seems to be ok, everything in lower case
> 
> Now, for the rest service:
> http://localhost:8983/solr/select/?q=PhysicalDescription:Farb*&debugQuery=true
> <str name="rawquerystring">PhysicalDescription:Farb*</str>
> <str name="querystring">PhysicalDescription:Farb*</str>
> <str name="parsedquery">PhysicalDescription:Farb*</str>
> <str name="parsedquery_toString">PhysicalDescription:Farb*</str>
> 
> Since Farb* has a capital letter, nothing is found.
> When using farb* as query, I get the result.
> 
> Where can I add/change a query anaylizer that "lower cases" wildcard
> searches?
> 
> thanks, best wishes,
> Alexander
>