You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitry Kan <dm...@gmail.com> on 2011/11/18 12:23:13 UTC

wild card search and lower-casing

Hello,

Here is one puzzle I couldn't yet find a key for:

for the wild-card query:

*ocvd

SOLR 3.4 returns hits. But for

*OCVD

it doesn't

On the indexing side two following tokenizers/filters are defined:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>

On the query side:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>

SOLR analysis tool shows, that OCVD gets lower-cased to ocvd. Does SOLR
skip a lower-casing step when doing the actual wild-card search?

BTW, the same issue for a trailing wild-card:

mocv*

produces hits, while

MOCV*

doesn't. Appreciate any help or pointers.


-- 
Regards,

Dmitry Kan

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

Yes, it should be ok, as currently we are on the English side. If that's
beneficial for the effort, I could do a field test on 3.4 after you close
the jira.

Best,
Dmitry

On Wed, Nov 23, 2011 at 2:52 PM, Erick Erickson <er...@gmail.com>wrote:

> Ah, I see what you're doing, go for it.
>
> I intend to commit it today, but things happen.....
>
> About changing the setLowerCaseExpandedTerms(true), yes
> that'll take care of this issue, although it has some
> locale-specific assumptions (i.e. string.toLowerCase() uses the
> default locale). That may not matter in your situation though.
>
> Best
> Erick
>
> On Tue, Nov 22, 2011 at 10:46 AM, Dmitry Kan <dm...@gmail.com> wrote:
> > Thanks, Erick. I was in fact reading the patch (the one attached as a
> > file to the aforementioned jira) you updated sometime yesterday. I'll
> > watch the issue, but as said the change of a hard-coded boolean to its
> > opposite worked just fine for me.
> >
> > Best,
> > Dmitry
> >
> >
> > On 11/22/11, Erick Erickson <er...@gmail.com> wrote:
> >> No, no, no.... That's something buried in Lucene, it has nothing to
> >> do with the patch! The patch has NOT yet been applied to any
> >> released code.
> >>
> >> You could pull the patch from the JIRA and apply it to trunk locally if
> >> you wanted. But there's no patch for 3.x, I'll probably put that up
> >> over the holiday.
> >>
> >> But things have changed a bit (one of the things I'll have to do is
> >> create some documentation). You *should* be able to specify
> >> just legacyMultiTerm="true" in your <fieldType> if you want to
> >> apply the 3.x patch to pre 3.6 code. It would be a good field test
> >> if that worked for you.
> >>
> >> But you can't do any of this until the JIRA (SOLR-2438) is
> >> marked "Resolution: Fixed".
> >>
> >> Don't be fooled by "Fix Version". "Fix Version" simply says
> >> that those are the earliest versions it *could* go in.
> >>
> >> Best
> >> Erick
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Nov 22, 2011 at 6:32 AM, Dmitry Kan <dm...@gmail.com>
> wrote:
> >>> I guess, I have found your comment, thanks.
> >>>
> >>> For our current needs I have just set:
> >>>
> >>> setLowercaseExpandedTerms(true); // changed from default false
> >>>
> >>> in the SolrQueryParser's constructor and that seem to work so far.
> >>>
> >>> In order not to start a separate thread on wildcards. Is it so, that
> for
> >>> the trailing wildcard there is a minimum of 2 preceding characters for
> a
> >>> search to happen?
> >>>
> >>> Dmitry
> >>>
> >>> On Mon, Nov 21, 2011 at 2:59 PM, Erick Erickson
> >>> <er...@gmail.com>wrote:
> >>>
> >>>> It may be. The tricky bit is that there is a constant governing the
> >>>> behavior of
> >>>> this that restricts it to 3.6 and above. You'll have to change it
> after
> >>>> applying
> >>>> the patch for this to work for you. Should be trivial, I'll leave a
> note
> >>>> in the
> >>>> code about this, look for SOLR-2438 in the 3x code line for the place
> >>>> to change.
> >>>>
> >>>> On Mon, Nov 21, 2011 at 2:14 AM, Dmitry Kan <dm...@gmail.com>
> wrote:
> >>>> > Thanks Erick.
> >>>> >
> >>>> > Do you think the patch you are working on will be applicable as
> well to
> >>>> 3.4?
> >>>> >
> >>>> > Best,
> >>>> > Dmitry
> >>>> >
> >>>> > On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson
> >>>> > <erickerickson@gmail.com
> >>>> >wrote:
> >>>> >
> >>>> >> As it happens I'm working on SOLR-2438 which should address this.
> This
> >>>> >> patch
> >>>> >> will provide two things:
> >>>> >>
> >>>> >> The ability to define a new analysis chain in your schema.xml,
> >>>> >> currently
> >>>> >> called
> >>>> >> "multiterm" that will be applied to queries of various sorts,
> >>>> >> including wildcard,
> >>>> >> prefix, range. This will be somewhat of an "expert" thing to make
> >>>> >> yourself...
> >>>> >>
> >>>> >> In the absence of an explicit definition it'll synthesize a
> multiterm
> >>>> >> analyzer
> >>>> >> out of the query analyzer, taking any char fitlers, and
> >>>> >> lowercaseFilter (if present),
> >>>> >> and ASCIIFoldingfilter (if present) and putting them in the
> multiterm
> >>>> >> analyzer along
> >>>> >> with a (hardcoded) WhitespaceTokenizer.
> >>>> >>
> >>>> >> As of 3.6 and 4.0, this will be the default behavior, although you
> can
> >>>> >> explicitly
> >>>> >> define a field type parameter to specify the current behavior.
> >>>> >>
> >>>> >> The reason it is on 3.6 is that I want it to bake for a while
> before
> >>>> >> getting into the
> >>>> >> wild, so I have no intention of trying to get it into the 3.5
> release.
> >>>> >>
> >>>> >> The patch is up for review now, I'd like another set of eyeballs or
> >>>> >> two on it before
> >>>> >> committing.
> >>>> >>
> >>>> >> The patch that's up there now is against trunk but I hope to have
> a 3x
> >>>> >> patch that
> >>>> >> I'll apply to the 3x code line after 3.5 RC1 is cut.
> >>>> >>
> >>>> >> Best
> >>>> >> Erick
> >>>> >>
> >>>> >>
> >>>> >> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com>
> >>>> wrote:
> >>>> >> >
> >>>> >> >> You're right:
> >>>> >> >>
> >>>> >> >> public SolrQueryParser(IndexSchema schema, String
> >>>> >> >> defaultField) {
> >>>> >> >> ...
> >>>> >> >> setLowercaseExpandedTerms(false);
> >>>> >> >> ...
> >>>> >> >> }
> >>>> >> >
> >>>> >> > Please note that lowercaseExpandedTerms uses String.toLowercase()
> >>>> (uses
> >>>> >>  default Locale) which is a Locale sensitive operation.
> >>>> >> >
> >>>> >> > In Lucene AnalyzingQueryParser exists for this purposes, but I am
> >>>> >> > not
> >>>> >> sure if it is ported to solr.
> >>>> >> >
> >>>> >> >
> >>>> >>
> >>>>
> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
> >>>> >> >
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>

Re: wild card search and lower-casing

Posted by Erick Erickson <er...@gmail.com>.

Ah, I see what you're doing, go for it.

I intend to commit it today, but things happen.....

About changing the setLowerCaseExpandedTerms(true), yes
that'll take care of this issue, although it has some
locale-specific assumptions (i.e. string.toLowerCase() uses the
default locale). That may not matter in your situation though.

Best
Erick

On Tue, Nov 22, 2011 at 10:46 AM, Dmitry Kan <dm...@gmail.com> wrote:
> Thanks, Erick. I was in fact reading the patch (the one attached as a
> file to the aforementioned jira) you updated sometime yesterday. I'll
> watch the issue, but as said the change of a hard-coded boolean to its
> opposite worked just fine for me.
>
> Best,
> Dmitry
>
>
> On 11/22/11, Erick Erickson <er...@gmail.com> wrote:
>> No, no, no.... That's something buried in Lucene, it has nothing to
>> do with the patch! The patch has NOT yet been applied to any
>> released code.
>>
>> You could pull the patch from the JIRA and apply it to trunk locally if
>> you wanted. But there's no patch for 3.x, I'll probably put that up
>> over the holiday.
>>
>> But things have changed a bit (one of the things I'll have to do is
>> create some documentation). You *should* be able to specify
>> just legacyMultiTerm="true" in your <fieldType> if you want to
>> apply the 3.x patch to pre 3.6 code. It would be a good field test
>> if that worked for you.
>>
>> But you can't do any of this until the JIRA (SOLR-2438) is
>> marked "Resolution: Fixed".
>>
>> Don't be fooled by "Fix Version". "Fix Version" simply says
>> that those are the earliest versions it *could* go in.
>>
>> Best
>> Erick
>>
>> Best
>> Erick
>>
>> On Tue, Nov 22, 2011 at 6:32 AM, Dmitry Kan <dm...@gmail.com> wrote:
>>> I guess, I have found your comment, thanks.
>>>
>>> For our current needs I have just set:
>>>
>>> setLowercaseExpandedTerms(true); // changed from default false
>>>
>>> in the SolrQueryParser's constructor and that seem to work so far.
>>>
>>> In order not to start a separate thread on wildcards. Is it so, that for
>>> the trailing wildcard there is a minimum of 2 preceding characters for a
>>> search to happen?
>>>
>>> Dmitry
>>>
>>> On Mon, Nov 21, 2011 at 2:59 PM, Erick Erickson
>>> <er...@gmail.com>wrote:
>>>
>>>> It may be. The tricky bit is that there is a constant governing the
>>>> behavior of
>>>> this that restricts it to 3.6 and above. You'll have to change it after
>>>> applying
>>>> the patch for this to work for you. Should be trivial, I'll leave a note
>>>> in the
>>>> code about this, look for SOLR-2438 in the 3x code line for the place
>>>> to change.
>>>>
>>>> On Mon, Nov 21, 2011 at 2:14 AM, Dmitry Kan <dm...@gmail.com> wrote:
>>>> > Thanks Erick.
>>>> >
>>>> > Do you think the patch you are working on will be applicable as well to
>>>> 3.4?
>>>> >
>>>> > Best,
>>>> > Dmitry
>>>> >
>>>> > On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson
>>>> > <erickerickson@gmail.com
>>>> >wrote:
>>>> >
>>>> >> As it happens I'm working on SOLR-2438 which should address this. This
>>>> >> patch
>>>> >> will provide two things:
>>>> >>
>>>> >> The ability to define a new analysis chain in your schema.xml,
>>>> >> currently
>>>> >> called
>>>> >> "multiterm" that will be applied to queries of various sorts,
>>>> >> including wildcard,
>>>> >> prefix, range. This will be somewhat of an "expert" thing to make
>>>> >> yourself...
>>>> >>
>>>> >> In the absence of an explicit definition it'll synthesize a multiterm
>>>> >> analyzer
>>>> >> out of the query analyzer, taking any char fitlers, and
>>>> >> lowercaseFilter (if present),
>>>> >> and ASCIIFoldingfilter (if present) and putting them in the multiterm
>>>> >> analyzer along
>>>> >> with a (hardcoded) WhitespaceTokenizer.
>>>> >>
>>>> >> As of 3.6 and 4.0, this will be the default behavior, although you can
>>>> >> explicitly
>>>> >> define a field type parameter to specify the current behavior.
>>>> >>
>>>> >> The reason it is on 3.6 is that I want it to bake for a while before
>>>> >> getting into the
>>>> >> wild, so I have no intention of trying to get it into the 3.5 release.
>>>> >>
>>>> >> The patch is up for review now, I'd like another set of eyeballs or
>>>> >> two on it before
>>>> >> committing.
>>>> >>
>>>> >> The patch that's up there now is against trunk but I hope to have a 3x
>>>> >> patch that
>>>> >> I'll apply to the 3x code line after 3.5 RC1 is cut.
>>>> >>
>>>> >> Best
>>>> >> Erick
>>>> >>
>>>> >>
>>>> >> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com>
>>>> wrote:
>>>> >> >
>>>> >> >> You're right:
>>>> >> >>
>>>> >> >> public SolrQueryParser(IndexSchema schema, String
>>>> >> >> defaultField) {
>>>> >> >> ...
>>>> >> >> setLowercaseExpandedTerms(false);
>>>> >> >> ...
>>>> >> >> }
>>>> >> >
>>>> >> > Please note that lowercaseExpandedTerms uses String.toLowercase()
>>>> (uses
>>>> >>  default Locale) which is a Locale sensitive operation.
>>>> >> >
>>>> >> > In Lucene AnalyzingQueryParser exists for this purposes, but I am
>>>> >> > not
>>>> >> sure if it is ported to solr.
>>>> >> >
>>>> >> >
>>>> >>
>>>> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>
>>
>
>
> --
> Regards,
>
> Dmitry Kan
>

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

Thanks, Erick. I was in fact reading the patch (the one attached as a
file to the aforementioned jira) you updated sometime yesterday. I'll
watch the issue, but as said the change of a hard-coded boolean to its
opposite worked just fine for me.

Best,
Dmitry


On 11/22/11, Erick Erickson <er...@gmail.com> wrote:
> No, no, no.... That's something buried in Lucene, it has nothing to
> do with the patch! The patch has NOT yet been applied to any
> released code.
>
> You could pull the patch from the JIRA and apply it to trunk locally if
> you wanted. But there's no patch for 3.x, I'll probably put that up
> over the holiday.
>
> But things have changed a bit (one of the things I'll have to do is
> create some documentation). You *should* be able to specify
> just legacyMultiTerm="true" in your <fieldType> if you want to
> apply the 3.x patch to pre 3.6 code. It would be a good field test
> if that worked for you.
>
> But you can't do any of this until the JIRA (SOLR-2438) is
> marked "Resolution: Fixed".
>
> Don't be fooled by "Fix Version". "Fix Version" simply says
> that those are the earliest versions it *could* go in.
>
> Best
> Erick
>
> Best
> Erick
>
> On Tue, Nov 22, 2011 at 6:32 AM, Dmitry Kan <dm...@gmail.com> wrote:
>> I guess, I have found your comment, thanks.
>>
>> For our current needs I have just set:
>>
>> setLowercaseExpandedTerms(true); // changed from default false
>>
>> in the SolrQueryParser's constructor and that seem to work so far.
>>
>> In order not to start a separate thread on wildcards. Is it so, that for
>> the trailing wildcard there is a minimum of 2 preceding characters for a
>> search to happen?
>>
>> Dmitry
>>
>> On Mon, Nov 21, 2011 at 2:59 PM, Erick Erickson
>> <er...@gmail.com>wrote:
>>
>>> It may be. The tricky bit is that there is a constant governing the
>>> behavior of
>>> this that restricts it to 3.6 and above. You'll have to change it after
>>> applying
>>> the patch for this to work for you. Should be trivial, I'll leave a note
>>> in the
>>> code about this, look for SOLR-2438 in the 3x code line for the place
>>> to change.
>>>
>>> On Mon, Nov 21, 2011 at 2:14 AM, Dmitry Kan <dm...@gmail.com> wrote:
>>> > Thanks Erick.
>>> >
>>> > Do you think the patch you are working on will be applicable as well to
>>> 3.4?
>>> >
>>> > Best,
>>> > Dmitry
>>> >
>>> > On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson
>>> > <erickerickson@gmail.com
>>> >wrote:
>>> >
>>> >> As it happens I'm working on SOLR-2438 which should address this. This
>>> >> patch
>>> >> will provide two things:
>>> >>
>>> >> The ability to define a new analysis chain in your schema.xml,
>>> >> currently
>>> >> called
>>> >> "multiterm" that will be applied to queries of various sorts,
>>> >> including wildcard,
>>> >> prefix, range. This will be somewhat of an "expert" thing to make
>>> >> yourself...
>>> >>
>>> >> In the absence of an explicit definition it'll synthesize a multiterm
>>> >> analyzer
>>> >> out of the query analyzer, taking any char fitlers, and
>>> >> lowercaseFilter (if present),
>>> >> and ASCIIFoldingfilter (if present) and putting them in the multiterm
>>> >> analyzer along
>>> >> with a (hardcoded) WhitespaceTokenizer.
>>> >>
>>> >> As of 3.6 and 4.0, this will be the default behavior, although you can
>>> >> explicitly
>>> >> define a field type parameter to specify the current behavior.
>>> >>
>>> >> The reason it is on 3.6 is that I want it to bake for a while before
>>> >> getting into the
>>> >> wild, so I have no intention of trying to get it into the 3.5 release.
>>> >>
>>> >> The patch is up for review now, I'd like another set of eyeballs or
>>> >> two on it before
>>> >> committing.
>>> >>
>>> >> The patch that's up there now is against trunk but I hope to have a 3x
>>> >> patch that
>>> >> I'll apply to the 3x code line after 3.5 RC1 is cut.
>>> >>
>>> >> Best
>>> >> Erick
>>> >>
>>> >>
>>> >> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com>
>>> wrote:
>>> >> >
>>> >> >> You're right:
>>> >> >>
>>> >> >> public SolrQueryParser(IndexSchema schema, String
>>> >> >> defaultField) {
>>> >> >> ...
>>> >> >> setLowercaseExpandedTerms(false);
>>> >> >> ...
>>> >> >> }
>>> >> >
>>> >> > Please note that lowercaseExpandedTerms uses String.toLowercase()
>>> (uses
>>> >>  default Locale) which is a Locale sensitive operation.
>>> >> >
>>> >> > In Lucene AnalyzingQueryParser exists for this purposes, but I am
>>> >> > not
>>> >> sure if it is ported to solr.
>>> >> >
>>> >> >
>>> >>
>>> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
>>> >> >
>>> >>
>>> >
>>>
>>
>


-- 
Regards,

Dmitry Kan

Re: wild card search and lower-casing

Posted by Erick Erickson <er...@gmail.com>.

No, no, no.... That's something buried in Lucene, it has nothing to
do with the patch! The patch has NOT yet been applied to any
released code.

You could pull the patch from the JIRA and apply it to trunk locally if
you wanted. But there's no patch for 3.x, I'll probably put that up
over the holiday.

But things have changed a bit (one of the things I'll have to do is
create some documentation). You *should* be able to specify
just legacyMultiTerm="true" in your <fieldType> if you want to
apply the 3.x patch to pre 3.6 code. It would be a good field test
if that worked for you.

But you can't do any of this until the JIRA (SOLR-2438) is
marked "Resolution: Fixed".

Don't be fooled by "Fix Version". "Fix Version" simply says
that those are the earliest versions it *could* go in.

Best
Erick

Best
Erick

On Tue, Nov 22, 2011 at 6:32 AM, Dmitry Kan <dm...@gmail.com> wrote:
> I guess, I have found your comment, thanks.
>
> For our current needs I have just set:
>
> setLowercaseExpandedTerms(true); // changed from default false
>
> in the SolrQueryParser's constructor and that seem to work so far.
>
> In order not to start a separate thread on wildcards. Is it so, that for
> the trailing wildcard there is a minimum of 2 preceding characters for a
> search to happen?
>
> Dmitry
>
> On Mon, Nov 21, 2011 at 2:59 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> It may be. The tricky bit is that there is a constant governing the
>> behavior of
>> this that restricts it to 3.6 and above. You'll have to change it after
>> applying
>> the patch for this to work for you. Should be trivial, I'll leave a note
>> in the
>> code about this, look for SOLR-2438 in the 3x code line for the place
>> to change.
>>
>> On Mon, Nov 21, 2011 at 2:14 AM, Dmitry Kan <dm...@gmail.com> wrote:
>> > Thanks Erick.
>> >
>> > Do you think the patch you are working on will be applicable as well to
>> 3.4?
>> >
>> > Best,
>> > Dmitry
>> >
>> > On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>> >
>> >> As it happens I'm working on SOLR-2438 which should address this. This
>> >> patch
>> >> will provide two things:
>> >>
>> >> The ability to define a new analysis chain in your schema.xml, currently
>> >> called
>> >> "multiterm" that will be applied to queries of various sorts,
>> >> including wildcard,
>> >> prefix, range. This will be somewhat of an "expert" thing to make
>> >> yourself...
>> >>
>> >> In the absence of an explicit definition it'll synthesize a multiterm
>> >> analyzer
>> >> out of the query analyzer, taking any char fitlers, and
>> >> lowercaseFilter (if present),
>> >> and ASCIIFoldingfilter (if present) and putting them in the multiterm
>> >> analyzer along
>> >> with a (hardcoded) WhitespaceTokenizer.
>> >>
>> >> As of 3.6 and 4.0, this will be the default behavior, although you can
>> >> explicitly
>> >> define a field type parameter to specify the current behavior.
>> >>
>> >> The reason it is on 3.6 is that I want it to bake for a while before
>> >> getting into the
>> >> wild, so I have no intention of trying to get it into the 3.5 release.
>> >>
>> >> The patch is up for review now, I'd like another set of eyeballs or
>> >> two on it before
>> >> committing.
>> >>
>> >> The patch that's up there now is against trunk but I hope to have a 3x
>> >> patch that
>> >> I'll apply to the 3x code line after 3.5 RC1 is cut.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >>
>> >> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com>
>> wrote:
>> >> >
>> >> >> You're right:
>> >> >>
>> >> >> public SolrQueryParser(IndexSchema schema, String
>> >> >> defaultField) {
>> >> >> ...
>> >> >> setLowercaseExpandedTerms(false);
>> >> >> ...
>> >> >> }
>> >> >
>> >> > Please note that lowercaseExpandedTerms uses String.toLowercase()
>> (uses
>> >>  default Locale) which is a Locale sensitive operation.
>> >> >
>> >> > In Lucene AnalyzingQueryParser exists for this purposes, but I am not
>> >> sure if it is ported to solr.
>> >> >
>> >> >
>> >>
>> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
>> >> >
>> >>
>> >
>>
>

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

I guess, I have found your comment, thanks.

For our current needs I have just set:

setLowercaseExpandedTerms(true); // changed from default false

in the SolrQueryParser's constructor and that seem to work so far.

In order not to start a separate thread on wildcards. Is it so, that for
the trailing wildcard there is a minimum of 2 preceding characters for a
search to happen?

Dmitry

On Mon, Nov 21, 2011 at 2:59 PM, Erick Erickson <er...@gmail.com>wrote:

> It may be. The tricky bit is that there is a constant governing the
> behavior of
> this that restricts it to 3.6 and above. You'll have to change it after
> applying
> the patch for this to work for you. Should be trivial, I'll leave a note
> in the
> code about this, look for SOLR-2438 in the 3x code line for the place
> to change.
>
> On Mon, Nov 21, 2011 at 2:14 AM, Dmitry Kan <dm...@gmail.com> wrote:
> > Thanks Erick.
> >
> > Do you think the patch you are working on will be applicable as well to
> 3.4?
> >
> > Best,
> > Dmitry
> >
> > On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> As it happens I'm working on SOLR-2438 which should address this. This
> >> patch
> >> will provide two things:
> >>
> >> The ability to define a new analysis chain in your schema.xml, currently
> >> called
> >> "multiterm" that will be applied to queries of various sorts,
> >> including wildcard,
> >> prefix, range. This will be somewhat of an "expert" thing to make
> >> yourself...
> >>
> >> In the absence of an explicit definition it'll synthesize a multiterm
> >> analyzer
> >> out of the query analyzer, taking any char fitlers, and
> >> lowercaseFilter (if present),
> >> and ASCIIFoldingfilter (if present) and putting them in the multiterm
> >> analyzer along
> >> with a (hardcoded) WhitespaceTokenizer.
> >>
> >> As of 3.6 and 4.0, this will be the default behavior, although you can
> >> explicitly
> >> define a field type parameter to specify the current behavior.
> >>
> >> The reason it is on 3.6 is that I want it to bake for a while before
> >> getting into the
> >> wild, so I have no intention of trying to get it into the 3.5 release.
> >>
> >> The patch is up for review now, I'd like another set of eyeballs or
> >> two on it before
> >> committing.
> >>
> >> The patch that's up there now is against trunk but I hope to have a 3x
> >> patch that
> >> I'll apply to the 3x code line after 3.5 RC1 is cut.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com>
> wrote:
> >> >
> >> >> You're right:
> >> >>
> >> >> public SolrQueryParser(IndexSchema schema, String
> >> >> defaultField) {
> >> >> ...
> >> >> setLowercaseExpandedTerms(false);
> >> >> ...
> >> >> }
> >> >
> >> > Please note that lowercaseExpandedTerms uses String.toLowercase()
> (uses
> >>  default Locale) which is a Locale sensitive operation.
> >> >
> >> > In Lucene AnalyzingQueryParser exists for this purposes, but I am not
> >> sure if it is ported to solr.
> >> >
> >> >
> >>
> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
> >> >
> >>
> >
>

Re: wild card search and lower-casing

Posted by Erick Erickson <er...@gmail.com>.

It may be. The tricky bit is that there is a constant governing the behavior of
this that restricts it to 3.6 and above. You'll have to change it after applying
the patch for this to work for you. Should be trivial, I'll leave a note in the
code about this, look for SOLR-2438 in the 3x code line for the place
to change.

On Mon, Nov 21, 2011 at 2:14 AM, Dmitry Kan <dm...@gmail.com> wrote:
> Thanks Erick.
>
> Do you think the patch you are working on will be applicable as well to 3.4?
>
> Best,
> Dmitry
>
> On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson <er...@gmail.com>wrote:
>
>> As it happens I'm working on SOLR-2438 which should address this. This
>> patch
>> will provide two things:
>>
>> The ability to define a new analysis chain in your schema.xml, currently
>> called
>> "multiterm" that will be applied to queries of various sorts,
>> including wildcard,
>> prefix, range. This will be somewhat of an "expert" thing to make
>> yourself...
>>
>> In the absence of an explicit definition it'll synthesize a multiterm
>> analyzer
>> out of the query analyzer, taking any char fitlers, and
>> lowercaseFilter (if present),
>> and ASCIIFoldingfilter (if present) and putting them in the multiterm
>> analyzer along
>> with a (hardcoded) WhitespaceTokenizer.
>>
>> As of 3.6 and 4.0, this will be the default behavior, although you can
>> explicitly
>> define a field type parameter to specify the current behavior.
>>
>> The reason it is on 3.6 is that I want it to bake for a while before
>> getting into the
>> wild, so I have no intention of trying to get it into the 3.5 release.
>>
>> The patch is up for review now, I'd like another set of eyeballs or
>> two on it before
>> committing.
>>
>> The patch that's up there now is against trunk but I hope to have a 3x
>> patch that
>> I'll apply to the 3x code line after 3.5 RC1 is cut.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>> >
>> >> You're right:
>> >>
>> >> public SolrQueryParser(IndexSchema schema, String
>> >> defaultField) {
>> >> ...
>> >> setLowercaseExpandedTerms(false);
>> >> ...
>> >> }
>> >
>> > Please note that lowercaseExpandedTerms uses String.toLowercase() (uses
>>  default Locale) which is a Locale sensitive operation.
>> >
>> > In Lucene AnalyzingQueryParser exists for this purposes, but I am not
>> sure if it is ported to solr.
>> >
>> >
>> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
>> >
>>
>

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

Thanks Erick.

Do you think the patch you are working on will be applicable as well to 3.4?

Best,
Dmitry

On Mon, Nov 21, 2011 at 5:06 AM, Erick Erickson <er...@gmail.com>wrote:

> As it happens I'm working on SOLR-2438 which should address this. This
> patch
> will provide two things:
>
> The ability to define a new analysis chain in your schema.xml, currently
> called
> "multiterm" that will be applied to queries of various sorts,
> including wildcard,
> prefix, range. This will be somewhat of an "expert" thing to make
> yourself...
>
> In the absence of an explicit definition it'll synthesize a multiterm
> analyzer
> out of the query analyzer, taking any char fitlers, and
> lowercaseFilter (if present),
> and ASCIIFoldingfilter (if present) and putting them in the multiterm
> analyzer along
> with a (hardcoded) WhitespaceTokenizer.
>
> As of 3.6 and 4.0, this will be the default behavior, although you can
> explicitly
> define a field type parameter to specify the current behavior.
>
> The reason it is on 3.6 is that I want it to bake for a while before
> getting into the
> wild, so I have no intention of trying to get it into the 3.5 release.
>
> The patch is up for review now, I'd like another set of eyeballs or
> two on it before
> committing.
>
> The patch that's up there now is against trunk but I hope to have a 3x
> patch that
> I'll apply to the 3x code line after 3.5 RC1 is cut.
>
> Best
> Erick
>
>
> On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> >
> >> You're right:
> >>
> >> public SolrQueryParser(IndexSchema schema, String
> >> defaultField) {
> >> ...
> >> setLowercaseExpandedTerms(false);
> >> ...
> >> }
> >
> > Please note that lowercaseExpandedTerms uses String.toLowercase() (uses
>  default Locale) which is a Locale sensitive operation.
> >
> > In Lucene AnalyzingQueryParser exists for this purposes, but I am not
> sure if it is ported to solr.
> >
> >
> http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
> >
>

Re: wild card search and lower-casing

Posted by Erick Erickson <er...@gmail.com>.

As it happens I'm working on SOLR-2438 which should address this. This patch
will provide two things:

The ability to define a new analysis chain in your schema.xml, currently called
"multiterm" that will be applied to queries of various sorts,
including wildcard,
prefix, range. This will be somewhat of an "expert" thing to make yourself...

In the absence of an explicit definition it'll synthesize a multiterm analyzer
out of the query analyzer, taking any char fitlers, and
lowercaseFilter (if present),
and ASCIIFoldingfilter (if present) and putting them in the multiterm
analyzer along
with a (hardcoded) WhitespaceTokenizer.

As of 3.6 and 4.0, this will be the default behavior, although you can
explicitly
define a field type parameter to specify the current behavior.

The reason it is on 3.6 is that I want it to bake for a while before
getting into the
wild, so I have no intention of trying to get it into the 3.5 release.

The patch is up for review now, I'd like another set of eyeballs or
two on it before
committing.

The patch that's up there now is against trunk but I hope to have a 3x
patch that
I'll apply to the 3x code line after 3.5 RC1 is cut.

Best
Erick

On Fri, Nov 18, 2011 at 12:05 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>
>> You're right:
>>
>> public SolrQueryParser(IndexSchema schema, String
>> defaultField) {
>> ...
>> setLowercaseExpandedTerms(false);
>> ...
>> }
>
> Please note that lowercaseExpandedTerms uses String.toLowercase() (uses  default Locale) which is a Locale sensitive operation.
>
> In Lucene AnalyzingQueryParser exists for this purposes, but I am not sure if it is ported to solr.
>
>  http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
>

Re: wild card search and lower-casing

Posted by Ahmet Arslan <io...@yahoo.com>.

> You're right:
> 
> public SolrQueryParser(IndexSchema schema, String
> defaultField) {
> ...
> setLowercaseExpandedTerms(false);
> ...
> }

Please note that lowercaseExpandedTerms uses String.toLowercase() (uses  default Locale) which is a Locale sensitive operation. 

In Lucene AnalyzingQueryParser exists for this purposes, but I am not sure if it is ported to solr.

  http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

You're right:

public SolrQueryParser(IndexSchema schema, String defaultField) {
...
setLowercaseExpandedTerms(false);
...
}

OK, thanks for pointing.

On Fri, Nov 18, 2011 at 4:12 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > Actually I have just checked the source code of Lucene's
> > QueryParser and
> > lowercaseExpandedTerms there is set to true by default
> > (version 3.4). The
> > code there does lower-casing by default. So in that sense I
> > don't need to
> > do anything in the client code. Is something wrong here?
>
> But SolrQueryParser extends that and default behavior may different. For
> clarification see source code of SolrQueryParser.
>



-- 
Regards,

Dmitry Kan

Re: wild card search and lower-casing

Posted by Ahmet Arslan <io...@yahoo.com>.

> Actually I have just checked the source code of Lucene's
> QueryParser and
> lowercaseExpandedTerms there is set to true by default
> (version 3.4). The
> code there does lower-casing by default. So in that sense I
> don't need to
> do anything in the client code. Is something wrong here?

But SolrQueryParser extends that and default behavior may different. For clarification see source code of SolrQueryParser.

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

OK.

Actually I have just checked the source code of Lucene's QueryParser and
lowercaseExpandedTerms there is set to true by default (version 3.4). The
code there does lower-casing by default. So in that sense I don't need to
do anything in the client code. Is something wrong here?

On Fri, Nov 18, 2011 at 3:49 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > Hi Ahmet,
> >
> > Thanks for the link.
> >
> > I'm a bit puzzled with the explanation found there
> > regarding lower casing:
> >
> > These queries are case-insensitive anyway because
> > QueryParser makes them
> > lowercase.
> >
> > that's exactly what I want to achieve, but somehow the
> > queries *are*
> > case-sensitive. Probably I should play around with code of
> > a query parser.
>
> There is an effort for this :
> https://issues.apache.org/jira/browse/SOLR-218
> You can vote this issue. For the time being you can lowercase them in the
> client side.
>

-- 
Regards,

Dmitry Kan

Re: wild card search and lower-casing

Posted by Ahmet Arslan <io...@yahoo.com>.

> Hi Ahmet,
> 
> Thanks for the link.
> 
> I'm a bit puzzled with the explanation found there
> regarding lower casing:
> 
> These queries are case-insensitive anyway because
> QueryParser makes them
> lowercase.
> 
> that's exactly what I want to achieve, but somehow the
> queries *are*
> case-sensitive. Probably I should play around with code of
> a query parser.

There is an effort for this : 
https://issues.apache.org/jira/browse/SOLR-218
You can vote this issue. For the time being you can lowercase them in the client side.

Re: wild card search and lower-casing

Posted by Dmitry Kan <dm...@gmail.com>.

Hi Ahmet,

Thanks for the link.

I'm a bit puzzled with the explanation found there regarding lower casing:

These queries are case-insensitive anyway because QueryParser makes them
lowercase.

that's exactly what I want to achieve, but somehow the queries *are*
case-sensitive. Probably I should play around with code of a query parser.

On Fri, Nov 18, 2011 at 2:50 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> > Here is one puzzle I couldn't yet find a key for:
> >
> > for the wild-card query:
> >
> > *ocvd
> >
> > SOLR 3.4 returns hits. But for
> >
> > *OCVD
> >
> > it doesn't
>
> This is a FAQ. Please see
>
>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
>

-- 
Regards,

Dmitry Kan

Re: wild card search and lower-casing

Posted by Ahmet Arslan <io...@yahoo.com>.

> Here is one puzzle I couldn't yet find a key for:
> 
> for the wild-card query:
> 
> *ocvd
> 
> SOLR 3.4 returns hits. But for
> 
> *OCVD
> 
> it doesn't

This is a FAQ. Please see 

http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F