You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@opennlp.apache.org by co...@apache.org on 2011/07/12 15:11:33 UTC

svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Author: colen
Date: Tue Jul 12 13:11:32 2011
New Revision: 1145578

URL: http://svn.apache.org/viewvc?rev=1145578&view=rev
Log:
OPENNLP-221 Created a BasicEvaluationParameters interface. 
Don't need to check encoding == null because it was validated by  ArgumentParser.

Added:
    incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java   (with props)
Modified:
    incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/SentenceDetectorEvaluatorTool.java
    incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/TokenizerMEEvaluatorTool.java

Added: incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java?rev=1145578&view=auto
==============================================================================
--- incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java (added)
+++ incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java Tue Jul 12 13:11:32 2011
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.cmdline;
+
+import java.io.File;
+import java.nio.charset.Charset;
+
+import opennlp.tools.cmdline.ArgumentParser.OptionalParameter;
+import opennlp.tools.cmdline.ArgumentParser.ParameterDescription;
+
+// TODO: remove the old BasicTrainingParameters and rename this class to BasicTrainingParameters
+
+/**
+ * Common evaluation parameters.
+ * 
+ * Note: Do not use this class, internal use only!
+ */
+public interface BasicEvaluationParameters {
+
+  @ParameterDescription(valueName = "charsetName", description = "specifies the encoding which should be used for reading and writing text")
+  @OptionalParameter(defaultValue="UTF-8")
+  Charset getEncoding();
+  
+  @ParameterDescription(valueName = "model", description = "the model file to be evaluated")
+  File getModel();
+  
+  @ParameterDescription(valueName = "testData", description = "the data to be used during evaluation")
+  File getData();
+  
+}

Propchange: incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java
------------------------------------------------------------------------------
    svn:mime-type = text/plain

Modified: incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/SentenceDetectorEvaluatorTool.java
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/SentenceDetectorEvaluatorTool.java?rev=1145578&r1=1145577&r2=1145578&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/SentenceDetectorEvaluatorTool.java (original)
+++ incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/SentenceDetectorEvaluatorTool.java Tue Jul 12 13:11:32 2011
@@ -22,8 +22,7 @@ import java.io.IOException;
 import java.nio.charset.Charset;
 
 import opennlp.tools.cmdline.ArgumentParser;
-import opennlp.tools.cmdline.ArgumentParser.OptionalParameter;
-import opennlp.tools.cmdline.ArgumentParser.ParameterDescription;
+import opennlp.tools.cmdline.BasicEvaluationParameters;
 import opennlp.tools.cmdline.CLI;
 import opennlp.tools.cmdline.CmdLineTool;
 import opennlp.tools.cmdline.CmdLineUtil;
@@ -34,22 +33,6 @@ import opennlp.tools.sentdetect.Sentence
 import opennlp.tools.util.ObjectStream;
 
 public final class SentenceDetectorEvaluatorTool implements CmdLineTool {
-  
-  /**
-   * Create a list of expected parameters.
-   */
-  interface Parameters {
-    
-    @ParameterDescription(valueName = "charsetName", description = "specifies the encoding which should be used for reading and writing text")
-    @OptionalParameter(defaultValue="UTF-8")
-    Charset getEncoding();
-    
-    @ParameterDescription(valueName = "model")
-    File getModel();
-    
-    @ParameterDescription(valueName = "data")
-    File getData();
-  }
 
   public String getName() {
     return "SentenceDetectorEvaluator";
@@ -60,25 +43,20 @@ public final class SentenceDetectorEvalu
   }
   
   public String getHelp() {
-    return "Usage: " + CLI.CMD + " " + getName() + " " + ArgumentParser.createUsage(Parameters.class);
+    return "Usage: " + CLI.CMD + " " + getName() + " " + ArgumentParser.createUsage(BasicEvaluationParameters.class);
   }
 
   public void run(String[] args) {
     
-    if (!ArgumentParser.validateArguments(args, Parameters.class)) {
+    if (!ArgumentParser.validateArguments(args, BasicEvaluationParameters.class)) {
       System.err.println(getHelp());
       throw new TerminateToolException(1);
     }
     
-    Parameters params = ArgumentParser.parse(args, Parameters.class);
+    BasicEvaluationParameters params = ArgumentParser.parse(args, BasicEvaluationParameters.class);
     
     Charset encoding = params.getEncoding();
     
-    if (encoding == null) {
-      System.out.println(getHelp());
-      throw new TerminateToolException(1);
-    }
-    
     SentenceModel model = new SentenceModelLoader().load(params.getModel());
     
     File trainingDataInFile = params.getData();

Modified: incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/TokenizerMEEvaluatorTool.java
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/TokenizerMEEvaluatorTool.java?rev=1145578&r1=1145577&r2=1145578&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/TokenizerMEEvaluatorTool.java (original)
+++ incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/TokenizerMEEvaluatorTool.java Tue Jul 12 13:11:32 2011
@@ -17,13 +17,13 @@
 
 package opennlp.tools.cmdline.tokenizer;
 
-import java.io.File;
 import java.io.IOException;
 import java.nio.charset.Charset;
 
+import opennlp.tools.cmdline.ArgumentParser;
+import opennlp.tools.cmdline.BasicEvaluationParameters;
 import opennlp.tools.cmdline.CLI;
 import opennlp.tools.cmdline.CmdLineTool;
-import opennlp.tools.cmdline.CmdLineUtil;
 import opennlp.tools.cmdline.TerminateToolException;
 import opennlp.tools.tokenize.TokenSample;
 import opennlp.tools.tokenize.TokenizerEvaluator;
@@ -41,32 +41,30 @@ public final class TokenizerMEEvaluatorT
   }
   
   public String getHelp() {
-    return "Usage: " + CLI.CMD + " " + getName() + "-encoding charset -model model -data testData";
+    return "Usage: " + CLI.CMD + " " + getName() + " " + ArgumentParser.createUsage(BasicEvaluationParameters.class);
   }
 
   public void run(String[] args) {
-    if (args.length != 6) {
-      System.out.println(getHelp());
+    if (!ArgumentParser
+        .validateArguments(args, BasicEvaluationParameters.class)) {
+      System.err.println(getHelp());
       throw new TerminateToolException(1);
     }
 
-    Charset encoding = CmdLineUtil.getEncodingParameter(args);
+    BasicEvaluationParameters params = ArgumentParser.parse(args,
+        BasicEvaluationParameters.class);
 
-    if (encoding == null) {
-      System.out.println(getHelp());
-      throw new TerminateToolException(1);
-    }
+    Charset encoding = params.getEncoding();
 
-    TokenizerModel model = new TokenizerModelLoader().load(
-        new File(CmdLineUtil.getParameter("-model", args)));
+    TokenizerModel model = new TokenizerModelLoader().load(params.getModel());
 
     TokenizerEvaluator evaluator = new TokenizerEvaluator(
         new opennlp.tools.tokenize.TokenizerME(model));
 
     System.out.print("Evaluating ... ");
 
-    ObjectStream<TokenSample> sampleStream = TokenizerTrainerTool.openSampleData(
-        "Test", new File(CmdLineUtil.getParameter("-data", args)), encoding);
+    ObjectStream<TokenSample> sampleStream = TokenizerTrainerTool
+        .openSampleData("Test", params.getData(), encoding);
 
     try {
       evaluator.evaluate(sampleStream);

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/12/11 3:34 PM, Jörn Kottmann wrote:
> On 7/12/11 3:11 PM, colen@apache.org wrote:
>> Added: 
>> incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java
>> URL:http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java?rev=1145578&view=auto 
>>
>> ============================================================================== 
>>
>> --- 
>> incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java 
>> (added)
>> +++ 
>> incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java 
>> Tue Jul
> ...
>> +
>> +  @ParameterDescription(valueName = "charsetName", description = 
>> "specifies the encoding which should be used for reading and writing 
>> text")
>> +  @OptionalParameter(defaultValue="UTF-8")
>> +  Charset getEncoding();
>
> We should decide how we handle this, and do it consistently. 

I guess we will decide to do it consistently.

We can define a new interface EncodingParameter, and the evaluators and 
training interfaces
can extend it, this way we only need to define the parameter once.

Jörn

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

On Tue, Jul 12, 2011 at 1:19 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 7/12/11 5:09 PM, Jörn Kottmann wrote:
>
>> Will we need to do some special handling in ArgumentParser for that? Maybe
>>> setting a constant "DEFAULT_CHARSET" and handle it at ArgumentParse.Parse
>>> ?
>>>
>>
>> +1 in the charset parsing code we can simply have an if and return the
>> default one.
>>
>
> Maybe we should add the constant for this to the OptionalParameters
> annotation,
> what do you think?
>
> Then we need an if-statement in ArgumentParser.**CharsetArgumentFactory,
> and then it can
> be declared as a default in the interfaces which wish to use the platform
> default as default
> for an encoding.
>


OK, I will do it.

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/12/11 5:09 PM, Jörn Kottmann wrote:
>> Will we need to do some special handling in ArgumentParser for that? 
>> Maybe
>> setting a constant "DEFAULT_CHARSET" and handle it at 
>> ArgumentParse.Parse ?
>
> +1 in the charset parsing code we can simply have an if and return the 
> default one. 

Maybe we should add the constant for this to the OptionalParameters 
annotation,
what do you think?

Then we need an if-statement in ArgumentParser.CharsetArgumentFactory, 
and then it can
be declared as a default in the interfaces which wish to use the 
platform default as default
for an encoding.

Jörn

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/12/11 4:49 PM, william.colen@gmail.com wrote:
> Will we need to do some special handling in ArgumentParser for that? Maybe
> setting a constant "DEFAULT_CHARSET" and handle it at ArgumentParse.Parse ?

+1 in the charset parsing code we can simply have an if and return the 
default one.

Jörn

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Jörn,

I'm wondering how to implement the EncodingParameter interface.

This is not allowed:

  @ParameterDescription(valueName = "charsetName", description = "specifies
the encoding which should be used for reading and writing text")
  @OptionalParameter(defaultValue=Charset.defaultCharset().name())
  Charset getEncoding();

Will we need to do some special handling in ArgumentParser for that? Maybe
setting a constant "DEFAULT_CHARSET" and handle it at ArgumentParse.Parse ?


On Tue, Jul 12, 2011 at 10:55 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 7/12/11 3:45 PM, william.colen@gmail.com wrote:
>
>> Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of
>> the
>> data I use for Portuguese takes advantage of UTF-8 been the default
>> because
>> all corpus I have are Latin1 and my system default is neither UTF-8 or
>> Latin1.
>>
>> Using the system default looks nice because often we have to use the
>> converter tools, and that outputs the system default. If we convert, train
>> and evaluate in the same system we would need to set the encoding
>> parameter
>> only once.
>>
>
> This is actually a weakness. I have a macbook, and my default encoding
> is MacRoman. I once tried to write japanese text to stdout, but that didn't
> work
> with MacRoman and more or less all chars have been replaced with a question
> mark
> (if I remember correctly).
>
> We might need to change that one day, so the output is always written to a
> file.
>
> I don't really know which of the both ways is better, always specify the
> encoding
> or use the default, anyway I am +1 for both. If you think we should go the
> more standard
> way and use the default encoding, then lets do that.
>
> Jörn
>

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/12/11 3:45 PM, william.colen@gmail.com wrote:
> Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of the
> data I use for Portuguese takes advantage of UTF-8 been the default because
> all corpus I have are Latin1 and my system default is neither UTF-8 or
> Latin1.
>
> Using the system default looks nice because often we have to use the
> converter tools, and that outputs the system default. If we convert, train
> and evaluate in the same system we would need to set the encoding parameter
> only once.

This is actually a weakness. I have a macbook, and my default encoding
is MacRoman. I once tried to write japanese text to stdout, but that 
didn't work
with MacRoman and more or less all chars have been replaced with a 
question mark
(if I remember correctly).

We might need to change that one day, so the output is always written to 
a file.

I don't really know which of the both ways is better, always specify the 
encoding
or use the default, anyway I am +1 for both. If you think we should go 
the more standard
way and use the default encoding, then lets do that.

Jörn

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

William Colen


On Tue, Jul 12, 2011 at 10:34 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 7/12/11 3:11 PM, colen@apache.org wrote:
>
>> Added: incubator/opennlp/trunk/**opennlp-tools/src/main/java/**
>> opennlp/tools/cmdline/**BasicEvaluationParameters.java
>> URL:http://svn.apache.org/**viewvc/incubator/opennlp/**
>> trunk/opennlp-tools/src/main/**java/opennlp/tools/cmdline/**
>> BasicEvaluationParameters.**java?rev=1145578&view=auto<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java?rev=1145578&view=auto>
>> ==============================**==============================**
>> ==================
>> --- incubator/opennlp/trunk/**opennlp-tools/src/main/java/**
>> opennlp/tools/cmdline/**BasicEvaluationParameters.java (added)
>> +++ incubator/opennlp/trunk/**opennlp-tools/src/main/java/**
>> opennlp/tools/cmdline/**BasicEvaluationParameters.java Tue Jul
>>
> ...
>
>  +
>> +  @ParameterDescription(**valueName = "charsetName", description =
>> "specifies the encoding which should be used for reading and writing text")
>> +  @OptionalParameter(**defaultValue="UTF-8")
>> +  Charset getEncoding();
>>
>
> We should decide how we handle this, and do it consistently.
> The trainers declare it as a mandatory parameter, the evaluators declare
> it as optional now and take UTF-8 as default.
>
> In my opinion we should either force the user to specify it, then he
> needs to think about the encoding. Or we use the platform default encoding,
> because
> that is the default a user would expect by convention since all software
> tools usually
> operate with the platform default encoding.
>
> Or is there a good reason to use UTF-8 as a default?
>
> I know that this is a decision which is difficult to get right,
> as far as I know we have been criticized for the current way of doing
> it because people don't want to pass the encoding parameter all the time.
>

Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of the
data I use for Portuguese takes advantage of UTF-8 been the default because
all corpus I have are Latin1 and my system default is neither UTF-8 or
Latin1.

Using the system default looks nice because often we have to use the
converter tools, and that outputs the system default. If we convert, train
and evaluate in the same system we would need to set the encoding parameter
only once.

Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/12/11 3:11 PM, colen@apache.org wrote:
> Added: incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java
> URL:http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java?rev=1145578&view=auto
> ==============================================================================
> --- incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java (added)
> +++ incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/BasicEvaluationParameters.java Tue Jul
...
> +
> +  @ParameterDescription(valueName = "charsetName", description = "specifies the encoding which should be used for reading and writing text")
> +  @OptionalParameter(defaultValue="UTF-8")
> +  Charset getEncoding();

We should decide how we handle this, and do it consistently.
The trainers declare it as a mandatory parameter, the evaluators declare
it as optional now and take UTF-8 as default.

In my opinion we should either force the user to specify it, then he
needs to think about the encoding. Or we use the platform default 
encoding, because
that is the default a user would expect by convention since all software 
tools usually
operate with the platform default encoding.

Or is there a good reason to use UTF-8 as a default?

I know that this is a decision which is difficult to get right,
as far as I know we have been criticized for the current way of doing
it because people don't want to pass the encoding parameter all the time.

Jörn