You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Peter Karich <pe...@yahoo.de> on 2010/11/18 23:26:17 UTC

WordDelimiterFilter bug

  Hi,

I asked this on the user list and I think I found a bug in 
WordDelimiterFilterFactory
for splitOnCaseChange="1" catenateAll="0" preserveOriginal="1" (+ 
lowercase filter).
Add the following test* and append the definition to the schema.xml**
and it won't pass. Should I open a JIRA issue for this or isn't this a 
bug and I missed something?
(The strange thing is that the admin GUI will highlight it correctly)

Regards,
Peter.

BTW: I just read the code of SpellCheckCollator because it didn't 
compile. It is:
} catch (Exception e) {
           Log.warn("Exception trying to re-query to check if a spell 
check possibility would return any hits.", e);
It should NOT use jetty Log -> remove jetty dep
} catch (Exception e) {
           LOG.warn("Exception trying to re-query to check if a spell 
check possibility would return any hits.", e);

*
   @Test
   public void testCaseChangeAndPreserve() {
     assertU(adoc("id",  "1",
                  "subword_cc", "abcd"));
     assertU(adoc("id",  "2",
                  "subword_cc", "abCd.com"));
     assertU(commit());

     assertQ("simple - case change and preserve",
             req("subword_cc:(abcd)")
             ,"//result[@numFound=1]"
     );
     // returns at the moment only doc 2
     // should also return doc1 because abCd should preserved + 
lowercase filter (for the query)
     assertQ("camel case query - case change and preserve",
             req("subword_cc:(abCd)")
             ,"//result[@numFound=2]"
     );
     // returns at the moment 0 docs
     // should return doc2 because abCd.com should preserved + lowercase 
filter (for the index)
     assertQ("camel case domain - case change and preserve",
             req("subword_cc:(abcd.com)")
             ,"//result[@numFound=1]"
     );
     clearIndex();
   }

**
<fieldtype name="subword_cc" class="solr.TextField" 
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" 
catenateAll="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" 
catenateAll="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>

<field name="subword_cc" type="subword_cc" indexed="true" stored="true"/>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: WordDelimiterFilter bug

Posted by Peter Karich <pe...@yahoo.de>.

  Thanks for the explanation! That makes sense :-)

Regards,
Peter.

> On Fri, Nov 19, 2010 at 6:18 AM, Peter Karich<pe...@yahoo.de>  wrote:
>>   Hi Robert,
>>
>> thanks a lot! I will try a newer solr version for other reasons then I will
>> try your suggested option too!
>> (I will repost your solution to the user mailing list if that is ok for you
>> ...)
> yes, please do!
>
>> Where can I find more info about phrasequeries? I only found*
>> I mean, how does MultiPhraseQuery selects its documents for (tw:"(abc a)
>> bc") ?
> the multiphrasequery is just like a more general phrase query.
>
> a phrase query for "abc bc" looks for "abc" in the document, followed by "bc"
> a multiphrasequery for "(abc a) bc" looks for ("abc OR a") in the
> document, followed by "bc".
>
> this is also the same way synonyms work with phrase queries.
> imagine you have a synonyms file that looks like this:
> dog =>  dog, dogs
> food =>  food, chow
>
> then if a user types "dog food", the resulting query is a
> multiphrasequery of "(dog dogs) (food chow)"
> this matches all 4 possibilities:
> dog food
> dogs food
> dog chow
> dogs chow
>
> for more information, you can see the code to this query here:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/MultiPhraseQuery.java


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: WordDelimiterFilter bug

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Nov 19, 2010 at 6:18 AM, Peter Karich <pe...@yahoo.de> wrote:
>  Hi Robert,
>
> thanks a lot! I will try a newer solr version for other reasons then I will
> try your suggested option too!
> (I will repost your solution to the user mailing list if that is ok for you
> ...)

yes, please do!

>
> Where can I find more info about phrasequeries? I only found*
> I mean, how does MultiPhraseQuery selects its documents for (tw:"(abc a)
> bc") ?

the multiphrasequery is just like a more general phrase query.

a phrase query for "abc bc" looks for "abc" in the document, followed by "bc"
a multiphrasequery for "(abc a) bc" looks for ("abc OR a") in the
document, followed by "bc".

this is also the same way synonyms work with phrase queries.
imagine you have a synonyms file that looks like this:
dog => dog, dogs
food => food, chow

then if a user types "dog food", the resulting query is a
multiphrasequery of "(dog dogs) (food chow)"
this matches all 4 possibilities:
dog food
dogs food
dog chow
dogs chow

for more information, you can see the code to this query here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/MultiPhraseQuery.java

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: WordDelimiterFilter bug

Posted by Peter Karich <pe...@yahoo.de>.

  Hi Robert,

thanks a lot! I will try a newer solr version for other reasons then I 
will try your suggested option too!
(I will repost your solution to the user mailing list if that is ok for 
you ...)

Where can I find more info about phrasequeries? I only found*
I mean, how does MultiPhraseQuery selects its documents for (tw:"(abc a) 
bc") ?

Regards,
Peter.

*
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-Lucene-and-Solr

> On Fri, Nov 19, 2010 at 5:12 AM, Peter Karich<pe...@yahoo.de>  wrote:
>>   Hi Robert,
>>
>>>   QueryGenerator^H^H^HParser
>> Thanks for the hint. I should have done a "debugQuery=on" earlier ... sorry.
>>
>> But how can I get:<str name="parsedquery">tw:abc tw:a tw:bc
>> instead of: "parsedquery">MultiPhraseQuery(tw:"(abc a) bc")
>> for the query "aBc" ?
>>
> If you are using Solr branch_3x or trunk, you can turn this off, by
> setting autoGeneratePhraseQueries to false in the fieldType.
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
> By enabling this option, phrase queries are only created by the
> queryparser when you enclose stuff in double quotes.
>
> If you are using an older version of solr such as 1.4.x, then you can
> only hack it, by adding a PositionFilterFactory to the end of your
> query analyzer.
> The downside to that approach (unfortunately the only approach, for
> older versions) is that it completely disables phrasequeries across
> the board for that field type.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: WordDelimiterFilter bug

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Nov 19, 2010 at 5:12 AM, Peter Karich <pe...@yahoo.de> wrote:
>  Hi Robert,
>
>>  QueryGenerator^H^H^HParser
>
> Thanks for the hint. I should have done a "debugQuery=on" earlier ... sorry.
>
> But how can I get: <str name="parsedquery">tw:abc tw:a tw:bc
> instead of: "parsedquery">MultiPhraseQuery(tw:"(abc a) bc")
> for the query "aBc" ?
>

If you are using Solr branch_3x or trunk, you can turn this off, by
setting autoGeneratePhraseQueries to false in the fieldType.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
By enabling this option, phrase queries are only created by the
queryparser when you enclose stuff in double quotes.

If you are using an older version of solr such as 1.4.x, then you can
only hack it, by adding a PositionFilterFactory to the end of your
query analyzer.
The downside to that approach (unfortunately the only approach, for
older versions) is that it completely disables phrasequeries across
the board for that field type.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: WordDelimiterFilter bug

Posted by Peter Karich <pe...@yahoo.de>.

  Hi Robert,

>  QueryGenerator^H^H^HParser

Thanks for the hint. I should have done a "debugQuery=on" earlier ... sorry.

But how can I get: <str name="parsedquery">tw:abc tw:a tw:bc
instead of: "parsedquery">MultiPhraseQuery(tw:"(abc a) bc")
for the query "aBc" ?

Regards,
Peter.

> On Thu, Nov 18, 2010 at 5:26 PM, Peter Karich<pe...@yahoo.de>  wrote:
>>   Hi,
>>
>> I asked this on the user list and I think I found a bug in
>> ...
>> (The strange thing is that the admin GUI will highlight it correctly)
>>
> because the admin gui highlights it, there's no bug.
>
> the reason it doesnt match is because of QueryGenerator^H^H^HParser
> automatically generating phrasequeries.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: WordDelimiterFilter bug

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 18, 2010 at 5:26 PM, Peter Karich <pe...@yahoo.de> wrote:
>  Hi,
>
> I asked this on the user list and I think I found a bug in
> ...
> (The strange thing is that the admin GUI will highlight it correctly)
>

because the admin gui highlights it, there's no bug.

the reason it doesnt match is because of QueryGenerator^H^H^HParser
automatically generating phrasequeries.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org