You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Björn Keil <de...@web.de> on 2015/05/20 12:33:38 UTC

[solr 5.1] Looking for full text + collation search field

Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field "composer" for which I passed the value "Dvořák, Antonín".

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:"dvorak, antonin"

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Unlike former versions of solr there do not seem to be
CollationKeyFilters which you may hang into the analyzer of a
solr.TextField... so I am a bit at a loss how I get *both* a tokenizer
and a collation at the same time.

Thanks for help,
Björn

Re: [solr 5.1] Looking for full text + collation search field

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Bjorn,

Not 100% sure but, ICUFoldingFilter may suit for you.
It also removes diacritics.

ahmet

On Thursday, May 21, 2015 3:20 PM, Björn Keil <gr...@yahoo.de> wrote:
Thanks for the advice. I have tried the field type and it seems to do what it is supposed to in combination with a lower case filter.

However, that raises another slight problem:

German umlauts are supposed to be treated slightly different for the purpose of searching than for sorting. For sorting a normal ICUCollationField with standard rules should suffice*, for the purpose of searching I cannot just replace an "ü" with a "u", "ü" is supposed to equal "ue", or, in terms of RuleBasedCollators, there is a secondary difference.

The rules for the collator include:

& ue , ü
& ae , ä
& oe , ö
& ss , ß

(again, that applies to searching *only*, for the sorting the rule "& a , ä" would apply, which is implied in the default rules.)

I can of course program a filter that does these rudimentary replacements myself, at best after the lower case filter but before the ASCIIFoldingFilter, I am just wondering if there isn't some way to use collations keys for full text search.

________________

* even though Latin script and specifically German is my primary concern, I want some rudimentary support for all European languages, including ones that use Cyrillic and Greek script, special symbols in Icelandic that are not strictly Latin and ligatures like "Æ", which collation keys could easily provide.

Ahmet Arslan <io...@yahoo.com.INVALID> schrieb am 22:10 Mittwoch, 20.Mai 2015:
Hi Bjorn,

solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields.

Your example looks like diacritics insensitive search.
Please see : ASCIIFoldingFilterFactory

Ahmet

On Wednesday, May 20, 2015 2:53 PM, Björn Keil <de...@web.de> wrote:
Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field "composer" for which I passed the value "Dvořák, Antonín".

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:"dvorak, antonin"

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Unlike former versions of solr there do not seem to be
CollationKeyFilters which you may hang into the analyzer of a
solr.TextField... so I am a bit at a loss how I get *both* a tokenizer
and a collation at the same time.

Thanks for help,
Björn

Re: [solr 5.1] Looking for full text + collation search field

Posted by TK Solr <tk...@sonic.net>.

On 5/21/15, 5:19 AM, Björn Keil wrote:
> Thanks for the advice. I have tried the field type and it seems to do what it is supposed to in combination with a lower case filter.
>
> However, that raises another slight problem:
>
> German umlauts are supposed to be treated slightly different for the purpose of searching than for sorting. For sorting a normal ICUCollationField with standard rules should suffice*, for the purpose of searching I cannot just replace an "ü" with a "u", "ü" is supposed to equal "ue", or, in terms of RuleBasedCollators, there is a secondary difference.

I haven't used this personally but GermanNormalizationFilter seems to do the job
https://lucene.apache.org/core/5_1_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html

Re: [solr 5.1] Looking for full text + collation search field

Posted by Björn Keil <gr...@yahoo.de>.

Thanks for the advice. I have tried the field type and it seems to do what it is supposed to in combination with a lower case filter.

However, that raises another slight problem:

The rules for the collator include:

& ue , ü
& ae , ä
& oe , ö
& ss , ß

(again, that applies to searching *only*, for the sorting the rule "& a , ä" would apply, which is implied in the default rules.)

________________

Ahmet Arslan <io...@yahoo.com.INVALID> schrieb am 22:10 Mittwoch, 20.Mai 2015:
Hi Bjorn,

solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields.

Your example looks like diacritics insensitive search.
Please see : ASCIIFoldingFilterFactory

Ahmet

On Wednesday, May 20, 2015 2:53 PM, Björn Keil <de...@web.de> wrote:
Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field "composer" for which I passed the value "Dvořák, Antonín".

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:"dvorak, antonin"

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Thanks for help,
Björn

Re: [solr 5.1] Looking for full text + collation search field

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Bjorn,

solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields.

Your example looks like diacritics insensitive search. 
Please see : ASCIIFoldingFilterFactory
 
Ahmet


On Wednesday, May 20, 2015 2:53 PM, Björn Keil <de...@web.de> wrote:
Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field "composer" for which I passed the value "Dvořák, Antonín".

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:"dvorak, antonin"

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Unlike former versions of solr there do not seem to be
CollationKeyFilters which you may hang into the analyzer of a
solr.TextField... so I am a bit at a loss how I get *both* a tokenizer
and a collation at the same time.

Thanks for help,
Björn