You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Karich <pe...@yahoo.de> on 2010/11/18 21:00:28 UTC

WordDelimiterFilterFactory + CamelCase query

  Hi,

I am going crazy but which config is necessary to include the missing doc 2?
I have:
doc1 tw:aBc
doc2 tw:abc

Now a query "aBc" returns only doc 1 although when I try doc2 from 
admin/analysis.jsp
then the term text 'abc' of the index gets highlighted as intended.
I even indexed a simple example (no stopwords, no protwords, no 
synonyms) via* and
tried this with the normal and dismax handler but I cannot make it 
working :-/

What have I misunderstood?

Regards,
Peter.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
                         generateWordParts="1" generateNumberParts="1" 
catenateAll="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
                         generateWordParts="1" generateNumberParts="1" 
catenateAll="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
</analyzer>
</fieldType>
--
<field name="tw" type="text" indexed="true" stored="true"/>

*
books.csv:

id,tw
1,aBc
2,abc

curl http://localhost:8983/solr/update/csv?commit=true --data-binary 
@books.csv -H 'Content-type:text/plain; charset=utf-8'


Re: WordDelimiterFilterFactory + CamelCase query

Posted by Peter Karich <pe...@yahoo.de>.
  Hi,

the final solution is explained here in context:
http://mail-archives.apache.org/mod_mbox/lucene-dev/201011.mbox/%3CAANLkTimaTGvpLPH_mGfbSUGhDOEDC8TC2bRRWxhiDO1K@mail.gmail.com%3E

"

/If you are using Solr branch_3x or trunk, you can turn this off, by
setting autoGeneratePhraseQueries to false in the fieldType.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
By enabling this option, phrase queries are only created by the
queryparser when you enclose stuff in double quotes.

If you are using an older version of solr such as 1.4.x, then you can
only hack it, by adding a PositionFilterFactory to the end of your
query analyzer.
The downside to that approach (unfortunately the only approach, for
older versions) is that it completely disables phrasequeries across
the board for that field type./

"
So, it is not a bug of wdf.
Thanks to Robert!

Regards,
Peter.

>  Hi,
>
> I am going crazy but which config is necessary to include the missing 
> doc 2?
> I have:
> doc1 tw:aBc
> doc2 tw:abc
>
> Now a query "aBc" returns only doc 1 although when I try doc2 from 
> admin/analysis.jsp
> then the term text 'abc' of the index gets highlighted as intended.
> I even indexed a simple example (no stopwords, no protwords, no 
> synonyms) via* and
> tried this with the normal and dismax handler but I cannot make it 
> working :-/
>
> What have I misunderstood?
>
> Regards,
> Peter.
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>                         generateWordParts="1" generateNumberParts="1" 
> catenateAll="0" preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English" 
> protected="protwords.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>                         generateWordParts="1" generateNumberParts="1" 
> catenateAll="0" preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English" 
> protected="protwords.txt"/>
> </analyzer>
> </fieldType>
> -- 
> <field name="tw" type="text" indexed="true" stored="true"/>
>
> *
> books.csv:
>
> id,tw
> 1,aBc
> 2,abc
>
> curl http://localhost:8983/solr/update/csv?commit=true --data-binary 
> @books.csv -H 'Content-type:text/plain; charset=utf-8'
>
>


Re: WordDelimiterFilterFactory + CamelCase query

Posted by Peter Karich <pe...@yahoo.de>.
> Peter,
>
> I recently had this issue, and I had to set splitOnCaseChange="0" to
> keep the word delimiter filter from doing what you describe. Can you
> try that and see if it helps?
>
> - Ken
>

Hi Ken,

yes this would solve my problem,
but then I would lost a match for 'SuperMario' if I query 'mario', right?

This is not an option for me at the moment. Maybe its a bug in solr? 
Again the admin page says:
"all is fine" but when I query via http (or SolrJ) it does not return 
doc2. Strange.

Regards,
Peter.

Re: WordDelimiterFilterFactory + CamelCase query

Posted by Ken Stanley <do...@gmail.com>.
On Thu, Nov 18, 2010 at 3:22 PM, Peter Karich <pe...@yahoo.de> wrote:
>
>> Hi,
>>
>> Please add preserveOriginal="1"  to your WDF [1] definition and reindex
>> (or
>> just try with the analysis page).
>
> but it is already there!?
>
> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>                         generateWordParts="1" generateNumberParts="1"
> catenateAll="0" preserveOriginal="1"/>
>
>
> Regards,
> Peter.
>

Peter,

I recently had this issue, and I had to set splitOnCaseChange="0" to
keep the word delimiter filter from doing what you describe. Can you
try that and see if it helps?

- Ken

Re: WordDelimiterFilterFactory + CamelCase query

Posted by Peter Karich <pe...@yahoo.de>.
> Hi,
>
> Please add preserveOriginal="1"  to your WDF [1] definition and reindex (or
> just try with the analysis page).

but it is already there!?

<filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
                          generateWordParts="1" generateNumberParts="1"
catenateAll="0" preserveOriginal="1"/>


Regards,
Peter.

> Hi,
>
> Please add preserveOriginal="1"  to your WDF [1] definition and reindex (or
> just try with the analysis page).
>
> This will make sure the original input token is being preserved along the
> newly generated tokens. If you then pass it all through a lowercase filter, it
> should match both documents.
>
> [1]:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
>
> Cheers,
>
>
>>    Hi,
>>
>> I am going crazy but which config is necessary to include the missing doc
>> 2? I have:
>> doc1 tw:aBc
>> doc2 tw:abc
>>
>> Now a query "aBc" returns only doc 1 although when I try doc2 from
>> admin/analysis.jsp
>> then the term text 'abc' of the index gets highlighted as intended.
>> I even indexed a simple example (no stopwords, no protwords, no
>> synonyms) via* and
>> tried this with the normal and dismax handler but I cannot make it
>> working :-/
>>
>> What have I misunderstood?
>>
>> Regards,
>> Peter.
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>>                           generateWordParts="1" generateNumberParts="1"
>> catenateAll="0" preserveOriginal="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory" language="English"
>> protected="protwords.txt"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>>                           generateWordParts="1" generateNumberParts="1"
>> catenateAll="0" preserveOriginal="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory" language="English"
>> protected="protwords.txt"/>
>> </analyzer>
>> </fieldType>
>> --
>> <field name="tw" type="text" indexed="true" stored="true"/>
>>
>> *
>> books.csv:
>>
>> id,tw
>> 1,aBc
>> 2,abc
>>
>> curl http://localhost:8983/solr/update/csv?commit=true --data-binary
>> @books.csv -H 'Content-type:text/plain; charset=utf-8'

Re: WordDelimiterFilterFactory + CamelCase query

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Please add preserveOriginal="1"  to your WDF [1] definition and reindex (or 
just try with the analysis page).

This will make sure the original input token is being preserved along the 
newly generated tokens. If you then pass it all through a lowercase filter, it 
should match both documents.

[1]: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Cheers,


>   Hi,
> 
> I am going crazy but which config is necessary to include the missing doc
> 2? I have:
> doc1 tw:aBc
> doc2 tw:abc
> 
> Now a query "aBc" returns only doc 1 although when I try doc2 from
> admin/analysis.jsp
> then the term text 'abc' of the index gets highlighted as intended.
> I even indexed a simple example (no stopwords, no protwords, no
> synonyms) via* and
> tried this with the normal and dismax handler but I cannot make it
> working :-/
> 
> What have I misunderstood?
> 
> Regards,
> Peter.
> 
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>                          generateWordParts="1" generateNumberParts="1"
> catenateAll="0" preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"
>                          generateWordParts="1" generateNumberParts="1"
> catenateAll="0" preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
> </analyzer>
> </fieldType>
> --
> <field name="tw" type="text" indexed="true" stored="true"/>
> 
> *
> books.csv:
> 
> id,tw
> 1,aBc
> 2,abc
> 
> curl http://localhost:8983/solr/update/csv?commit=true --data-binary
> @books.csv -H 'Content-type:text/plain; charset=utf-8'