You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Olala <ht...@gmail.com> on 2009/12/23 08:42:56 UTC

Search both diacritics and non-diacritics

Hi all!

I am developing a seach engine with Solr, and now I want to search both with
and without diacritics, for example: if I query kho, it will response kho,
khó, khò,... But if I query khó, it will response only khó.

Who anyone have solution? I have used <filter
class="solr.ISOLatin1AccentFilterFactory"/> but it is not correct :(
-- 
View this message in context: http://old.nabble.com/Search-both-diacritics-and-non-diacritics-tp26897627p26897627.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search both diacritics and non-diacritics

Posted by Yurish <yu...@inbox.lv>.


Olala wrote:
> 
> Hi all!
> 
> I am developing a seach engine with Solr, and now I want to search both
> with and without diacritics, for example: if I query kho, it will response
> kho, khó, khò,... But if I query khó, it will response only khó.
> 
> Who anyone have solution? I have used <filter
> class="solr.ISOLatin1AccentFilterFactory"/> but it is not correct :(
> 

How about using 	<filter class="solr.PatternReplaceFilterFactory"/> ? Here
you can define regexp, in which you can define: If term has some diactrics,
then convert it to non-diactric. Then, concatenate to this non-diactric term
your original one. 
Place it in index part. In query part don't convert your query in such
pattern. Then, you must be able to search kho and get both: with diactrics
and without, but when querying kho with diactrics, get only with diactrics..
-- 
View this message in context: http://old.nabble.com/Search-both-diacritics-and-non-diacritics-tp26897627p26897638.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search both diacritics and non-diacritics

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sun, Jan 3, 2010 at 6:01 AM, Lance Norskog <go...@gmail.com> wrote:

> The ASCIIFoldingFilter is a superset of the ISOLatin1Filter -
> ISOLatin1 is deprecated.  Here's the Javadoc from ASCIIFoldingFIlter.
> You did not mention which language you want to search.
>
> Unforch, the ASCIIFoldingFilter is not mentioned on the Solr wiki.
>
>
Thanks Lance. I've added it to the wiki at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

-- 
Regards,
Shalin Shekhar Mangar.

Re: Search both diacritics and non-diacritics

Posted by Lance Norskog <go...@gmail.com>.
The ASCIIFoldingFilter is a superset of the ISOLatin1Filter -
ISOLatin1 is deprecated.  Here's the Javadoc from ASCIIFoldingFIlter.
You did not mention which language you want to search.

Unforch, the ASCIIFoldingFilter is not mentioned on the Solr wiki.

http://www.lucidimagination.com/search/?q=ASCIIFoldingFilter+

http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/analysis/ASCIIFoldingFilter.html

org.apache.lucene.analysis.ASCIIFoldingFilter
This class converts alphabetic, numeric, and symbolic Unicode
characters which are not in the first 127 ASCII characters (the "Basic
Latin" Unicode block) into their ASCII equivalents, if one exists.
Characters from the following Unicode blocks are converted; however,
only those characters with reasonable ASCII alternatives are
converted:

C1 Controls and Latin-1 Supplement: http://www.unicode.org/charts/PDF/U0080.pdf
Latin Extended-A: http://www.unicode.org/charts/PDF/U0100.pdf
Latin Extended-B: http://www.unicode.org/charts/PDF/U0180.pdf
Latin Extended Additional: http://www.unicode.org/charts/PDF/U1E00.pdf
Latin Extended-C: http://www.unicode.org/charts/PDF/U2C60.pdf
Latin Extended-D: http://www.unicode.org/charts/PDF/UA720.pdf
IPA Extensions: http://www.unicode.org/charts/PDF/U0250.pdf
Phonetic Extensions: http://www.unicode.org/charts/PDF/U1D00.pdf
Phonetic Extensions Supplement: http://www.unicode.org/charts/PDF/U1D80.pdf
General Punctuation: http://www.unicode.org/charts/PDF/U2000.pdf
Superscripts and Subscripts: http://www.unicode.org/charts/PDF/U2070.pdf
Enclosed Alphanumerics: http://www.unicode.org/charts/PDF/U2460.pdf
Dingbats: http://www.unicode.org/charts/PDF/U2700.pdf
Supplemental Punctuation: http://www.unicode.org/charts/PDF/U2E00.pdf
Alphabetic Presentation Forms: http://www.unicode.org/charts/PDF/UFB00.pdf
Halfwidth and Fullwidth Forms: http://www.unicode.org/charts/PDF/UFF00.pdf
See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode The set
of character conversions supported by this class is a superset of
those supported by Lucene's ISOLatin1AccentFilter which strips accents
from Latin1 characters. For example, 'à' will be replaced by 'a'.

RE: Search both diacritics and non-diacritics

Posted by Chris Hostetter <ho...@fucit.org>.
: I have done follow it, but if I query with diacritic it respose only
: non-diacritic. But I want to query without diacritic anh then solr will be
: response both of diacritic and without diacritic :( 

What is "it" that you have done? ... can you show us your config?

The diatric folding issue is essentially the same as lowercasing ... you 
have a "strict" fieldtype where you don't lowercase/fold at index or query 
time, and then you have a "flattened" field where you lowercase/fold at 
indextime but not at query time ... then you query both using dismax 
(possibly with tie=0.0)

if the user searches w/diatrics, they will only get a match against the 
strict field for documents that also had those exact diatrics (they won't 
get any match on the flattened field).  if they query w/o diatrics they 
will get a match on the strict field for nay doc that didn't have 
diatrics, and they will also get matches on the flattened field for docs 
that did have diatrics (but still match because they were flattened)


-Hoss


RE: Search both diacritics and non-diacritics

Posted by Olala <ht...@gmail.com>.
I have done follow it, but if I query with diacritic it respose only
non-diacritic. But I want to query without diacritic anh then solr will be
response both of diacritic and without diacritic :( 


Steven A Rowe wrote:
> 
> Hi Olala,
> 
> You can get something similar to what you want by copying the original
> field to another one where, as Hoss suggests, you apply
> ASCIIFoldingFilterFactory, and the rewrite queries to match against both
> fields, with higher boost given to the original field.
> 
> @Hoss: Olala would benefit from a feature that AFAICT Solr doesn't
> currently have: the ability to add synonyms based on arbritrary
> transforms.
> 
> Steve
> 
> On 12/28/2009 at 5:33 AM, Olala wrote:
>> 
>> I tried but it still not correct :(
>> 
>> hossman wrote:
>> > > I am developing a seach engine with Solr, and now I want to search
>> > > both with and without diacritics, for example: if I query kho, it
>> > > will response kho, khó, khò,... But if I query khó, it will
>> > > response only khó.
>> > > 
>> > > Who anyone have solution? I have used <filter
>> > > class="solr.ISOLatin1AccentFilterFactory"/> but it is not correct
>> > (
>> > 
>> > try ASCIIFoldingFilterFactory instead.
>> > 
>> > -Hoss
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Search-both-diacritics-and-non-diacritics-tp26897627p26975115.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Search both diacritics and non-diacritics

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Olala,

You can get something similar to what you want by copying the original field to another one where, as Hoss suggests, you apply ASCIIFoldingFilterFactory, and the rewrite queries to match against both fields, with higher boost given to the original field.

@Hoss: Olala would benefit from a feature that AFAICT Solr doesn't currently have: the ability to add synonyms based on arbritrary transforms.

Steve

On 12/28/2009 at 5:33 AM, Olala wrote:
> 
> I tried but it still not correct :(
> 
> hossman wrote:
> > > I am developing a seach engine with Solr, and now I want to search
> > > both with and without diacritics, for example: if I query kho, it
> > > will response kho, khó, khò,... But if I query khó, it will
> > > response only khó.
> > > 
> > > Who anyone have solution? I have used <filter
> > > class="solr.ISOLatin1AccentFilterFactory"/> but it is not correct
> > (
> > 
> > try ASCIIFoldingFilterFactory instead.
> > 
> > -Hoss


Re: Search both diacritics and non-diacritics

Posted by Olala <ht...@gmail.com>.
I tried but it still not correct :(


hossman wrote:
> 
> 
> : I am developing a seach engine with Solr, and now I want to search both
> with
> : and without diacritics, for example: if I query kho, it will response
> kho,
> : khó, khò,... But if I query khó, it will response only khó.
> : 
> : Who anyone have solution? I have used <filter
> : class="solr.ISOLatin1AccentFilterFactory"/> but it is not correct :(
> 
> try ASCIIFoldingFilterFactory instead.
> 
> 
> -Hoss
> 
> 

-- 
View this message in context: http://old.nabble.com/Search-both-diacritics-and-non-diacritics-tp26897627p26941627.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search both diacritics and non-diacritics

Posted by Chris Hostetter <ho...@fucit.org>.
: I am developing a seach engine with Solr, and now I want to search both with
: and without diacritics, for example: if I query kho, it will response kho,
: khó, khò,... But if I query khó, it will response only khó.
: 
: Who anyone have solution? I have used <filter
: class="solr.ISOLatin1AccentFilterFactory"/> but it is not correct :(

try ASCIIFoldingFilterFactory instead.


-Hoss