You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Marc Pompl (JIRA)" <ji...@apache.org> on 2010/12/11 01:27:02 UTC
[jira] Updated: (CODEC-107) Enhance documentation for Language
Encoders
[ https://issues.apache.org/jira/browse/CODEC-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marc Pompl updated CODEC-107:
-----------------------------
Description:
The current userguide (http://commons.apache.org/codec/userguide.html) just lists four Language Encoders, but there are five at the moment. CODEC-106 implements a sixth one.
Would be a good idea, to complete documentation.
Additionally, I suggest to extent the userguide in order to show a simple performance measurement:
_SNAP_
Metaphone encodings per sec: 32258
DoubleMetaphone encodings per sec: 31250
Soundex encodings per sec: 35714
RefinedSoundex encodings per sec: 34482
Caverphone encodings per sec: 5813
ColognePhonetic encodings per sec: 33333
So, Caverphone is much slower than any other algorithm. All others show off nearly the same performance.
Checked with the following code:
{code:java}
public void checkSpeed() throws Exception {
checkSpeedEncoding("Metaphone", "easgasg", "ESKS");
checkSpeedEncoding("DoubleMetaphone", "easgasg", "ASKS");
checkSpeedEncoding("Soundex", "easgasg", "E220");
checkSpeedEncoding("RefinedSoundex", "easgasg", "E034034");
checkSpeedEncoding("Caverphone", "Carlene", "KLN1111111");
checkSpeedEncoding("ColognePhonetic", "Schmitt", "862");
}
private void checkSpeedEncoding(String encoder, String toBeEncoded, String estimated) throws Exception {
long start = System.currentTimeMillis();
for ( int i=0; i<REPEATS; i++) {
assertAlgorithm(encoder, "false", toBeEncoded,
new String[] { estimated });
}
long duration = System.currentTimeMillis()-start;
System.out.println(encoder + " encodings per sec: "+(REPEATS/(duration/1000)));
}
{code}
_SNAP_
was:
The current userguide (http://commons.apache.org/codec/userguide.html) just lists four Language Encoders, but there are five at the moment. CODEC-106 implements a sixth one.
Would be a good idea, to complete documentation.
Additionally, I suggest to extent the wiki (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory) in order to show a simple performance measurement:
_SNAP_
Metaphone encodings per sec: 32258
DoubleMetaphone encodings per sec: 31250
Soundex encodings per sec: 35714
RefinedSoundex encodings per sec: 34482
Caverphone encodings per sec: 5813
ColognePhonetic encodings per sec: 33333
So, Caverphone is much slower than any other algorithm. All others show off nearly the same performance.
Checked with the following code:
{code:java}
public void checkSpeed() throws Exception {
checkSpeedEncoding("Metaphone", "easgasg", "ESKS");
checkSpeedEncoding("DoubleMetaphone", "easgasg", "ASKS");
checkSpeedEncoding("Soundex", "easgasg", "E220");
checkSpeedEncoding("RefinedSoundex", "easgasg", "E034034");
checkSpeedEncoding("Caverphone", "Carlene", "KLN1111111");
checkSpeedEncoding("ColognePhonetic", "Schmitt", "862");
}
private void checkSpeedEncoding(String encoder, String toBeEncoded, String estimated) throws Exception {
long start = System.currentTimeMillis();
for ( int i=0; i<REPEATS; i++) {
assertAlgorithm(encoder, "false", toBeEncoded,
new String[] { estimated });
}
long duration = System.currentTimeMillis()-start;
System.out.println(encoder + " encodings per sec: "+(REPEATS/(duration/1000)));
}
{code}
_SNAP_
> Enhance documentation for Language Encoders
> -------------------------------------------
>
> Key: CODEC-107
> URL: https://issues.apache.org/jira/browse/CODEC-107
> Project: Commons Codec
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Marc Pompl
> Priority: Minor
> Fix For: 1.5
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> The current userguide (http://commons.apache.org/codec/userguide.html) just lists four Language Encoders, but there are five at the moment. CODEC-106 implements a sixth one.
> Would be a good idea, to complete documentation.
> Additionally, I suggest to extent the userguide in order to show a simple performance measurement:
> _SNAP_
> Metaphone encodings per sec: 32258
> DoubleMetaphone encodings per sec: 31250
> Soundex encodings per sec: 35714
> RefinedSoundex encodings per sec: 34482
> Caverphone encodings per sec: 5813
> ColognePhonetic encodings per sec: 33333
> So, Caverphone is much slower than any other algorithm. All others show off nearly the same performance.
> Checked with the following code:
> {code:java}
> public void checkSpeed() throws Exception {
> checkSpeedEncoding("Metaphone", "easgasg", "ESKS");
> checkSpeedEncoding("DoubleMetaphone", "easgasg", "ASKS");
> checkSpeedEncoding("Soundex", "easgasg", "E220");
> checkSpeedEncoding("RefinedSoundex", "easgasg", "E034034");
> checkSpeedEncoding("Caverphone", "Carlene", "KLN1111111");
> checkSpeedEncoding("ColognePhonetic", "Schmitt", "862");
> }
>
> private void checkSpeedEncoding(String encoder, String toBeEncoded, String estimated) throws Exception {
> long start = System.currentTimeMillis();
> for ( int i=0; i<REPEATS; i++) {
> assertAlgorithm(encoder, "false", toBeEncoded,
> new String[] { estimated });
> }
> long duration = System.currentTimeMillis()-start;
> System.out.println(encoder + " encodings per sec: "+(REPEATS/(duration/1000)));
> }
> {code}
> _SNAP_
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.