You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2004/03/20 16:44:39 UTC

DO NOT REPLY [Bug 26763] - [PATCH] Language guesser contribution

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=26763>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=26763

[PATCH] Language guesser contribution





------- Additional Comments From halleux.jf@skynet.be  2004-03-20 15:44 -------
For those interested here is a new iteration of this small guesser.

Changes:
- store all reference files in a single Jar file
- stream-based reading of the Jar should allow to use it in unpacked WAR files
- added one or two languages
- much faster
- made thread-safe
- added the possibility to restrict the guessing to a subset of the recognized 
languages
- corrected Windows-like path

Here are some results for the currently recognized languages (language is 
followed by the percentage of good guessing reading n trigrams from the input, 
from 30 to 3)

fr|99|99|99|98|99|100|98|99|98|97|96|99|100|96|93|94|96|90|88|90|89|91|87|78|78|
74|47|45

en|99|98|100|100|98|99|99|98|99|97|96|96|99|98|99|96|99|83|87|95|83|82|84|80|66|
65|50|47

da|100|99|100|100|98|100|99|100|98|100|97|99|97|98|99|96|96|96|96|92|90|94|89|78
|75|66|68|46

de|99|100|99|99|100|99|98|97|99|99|98|97|94|95|98|97|95|94|93|93|93|89|89|88|83|
84|67|64

sv|99|100|99|99|97|96|97|99|99|95|92|93|96|91|92|91|93|90|81|81|77|79|82|72|64|6
1|48|36

it|95|98|100|98|98|96|95|100|98|98|97|98|95|97|99|98|95|96|96|94|91|89|84|86|76|
72|62|55


Have fun,

Jean-Francois Halleux

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org