You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Karl Wettin <ka...@kodapan.se> on 2013/05/22 14:37:46 UTC

Blåbærsyltetøy v.s. Räksmörgås

This is a question (or perhaps a line of thought) regarding the mutually intelligible Scandinavian languages Danish, Norwegian and Swedish.

The Swedish letters åäö is in fact the same letters as the Danish/Norwegian åæø. A Norwegian writing about the Swedish city of Göteborg write Gøteborg and a Swedish person writing about Svolvær will write Svolvär. This is easy to fix, I can just index synonyms where äö is replaced by æø and vice verse.

More problematic, at least in my head, is ASCII-folding.

When a Swedish person is lacking umlauted characters on the keyboard they consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, a, o.

In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark but the pattern is probably the same. I have no clue to what letters foreigners might be replacing them with.

There's a lot of mismatch here. For a start ASCIIFoldingFilter translate 'ä' to 'a' and 'æ' as 'ae'. The rest is not aligned with what people actually type, such as 'ø' to 'o' rather than the more common 'oe'.

I'm considering:

* Forking ASCIIFoldingFilter with a bunch of strategies and index permutations of synonyms.
or
* Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one.

Anyone else that thought about this?

karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Blåbærsyltetøy v.s. Räksmörgås

Posted by Karl Wettin <ka...@kodapan.se>.

22 maj 2013 kl. 20:29 skrev Petite Abeille:

> 
> On May 22, 2013, at 7:08 PM, Karl Wettin <ka...@kodapan.se> wrote:
> 
>>> * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one.
>> 
>> I ended up with that solution.
>> 
>> https://issues.apache.org/jira/browse/LUCENE-5013
> 
> Interesting problem… perhaps you could generalize your solution a bit… for example, in, say, German, one could substitute 'ue' for 'ü', etc… so it looks like what you are after is folding double vowels… irrespectively of how they got there…
> 
> So… assuming something along the lines of Sean M. Burke Unidecode [1] for the purpose of ASCII transliteration, what's left is simply to fold double vowels, e.g.:

I pasted your reply as a comment in the JIRA-issue.

Hmmm interesting thought though. I have to consider if it make sense to make it this generic. I think it might be problematic for some languages though, especially Dutch.



			karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Blåbærsyltetøy v.s. Räksmörgås

Posted by Petite Abeille <pe...@mac.com>.

On May 22, 2013, at 7:08 PM, Karl Wettin <ka...@kodapan.se> wrote:

>> * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one.
> 
> I ended up with that solution.
> 
> https://issues.apache.org/jira/browse/LUCENE-5013

Interesting problem… perhaps you could generalize your solution a bit… for example, in, say, German, one could substitute 'ue' for 'ü', etc… so it looks like what you are after is folding double vowels… irrespectively of how they got there…

So… assuming something along the lines of Sean M. Burke Unidecode [1] for the purpose of ASCII transliteration, what's left is simply to fold double vowels, e.g.:

print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 6, Unidecode( 'Göteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 8, Unidecode( 'Über' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 9, Unidecode( 'ueber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 10, Unidecode( 'uber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 11, Unidecode( 'uuber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )

> 1	blabarsyltetoj
> 2	blabarsyltetoj
> 3	blabarsyltetoj
> 4	blabarsyltetoj
> 5	raksmorgas
> 6	goteborg
> 7	goteborg	
> 8	uber	
> 9	uber	
> 10	uber	
> 11	uber	



[1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Blåbærsyltetøy v.s. Räksmörgås

Posted by Karl Wettin <ka...@kodapan.se>.

22 maj 2013 kl. 14:37 skrev Karl Wettin:

> * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one.

I ended up with that solution.

https://issues.apache.org/jira/browse/LUCENE-5013



			karl