You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by GitBox <gi...@apache.org> on 2022/12/04 13:02:19 UTC
[GitHub] [opennlp] kinow commented on a diff in pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

kinow commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038965583


##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetector.java:
##########
@@ -20,14 +20,29 @@
 import java.io.Serializable;
 
 /**
- * The interface for LanguageDetector which provide the @{@link Language} according to the context.
+ * The interface for {@link LanguageDetector} which predicts the {@link Language} for a context.
  */
 public interface LanguageDetector extends Serializable {
 
+  /**
+   * Predicts the {@link Language languages} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the predicted languages
+   */
   Language[] predictLanguages(CharSequence content);
 
+  /**
+   * Predicts the {@link Language} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the language with the highest confidence
+   */
   Language predictLanguage(CharSequence content);
 
+  /**
+   * @return Retrieves an array of language (codes) that are supported by a {@link LanguageDetector}.

Review Comment:
   Some @return start with upper case, others don't. Don't really bother me, but just in case there is a convention or intention to stardardize it 👍 (no need to change anything if you don't want too, really).



##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -114,16 +135,30 @@ public StringList predictNextTokens(StringList tokens) {
     return token;
   }
 
+  private double calculateProbability(StringList tokens) {
+    double probability = 0d;
+    if (size() > 0) {
+      for (StringList ngram : NGramUtils.getNGrams(tokens, n)) {
+        double score = stupidBackoff(ngram);
+        probability += StrictMath.log(score);
+        if (Double.isNaN(probability)) {
+          probability = 0d;
+          break;
+        }
+      }
+      probability = StrictMath.exp(probability);
+    }
+    return probability;
+  }
+
   @Override
   public String[] predictNextTokens(String... tokens) {
     double maxProb = Double.NEGATIVE_INFINITY;
     String[] token = null;
 
     for (StringList ngram : this) {
       String[] sequence = new String[ngram.size() + tokens.length];
-      for (int i = 0; i < tokens.length; i++) {
-        sequence[i] = tokens[i];
-      }
+      System.arraycopy(tokens, 0, sequence, 0, tokens.length);

Review Comment:
   👏



##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -91,6 +111,7 @@ public double calculateProbability(String... tokens) {
   }
 
   @Override
+  @Deprecated

Review Comment:
   Is there a recommendation on what to do if calling this deprecated method?  What to use instead?



##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java:
##########
@@ -34,11 +34,12 @@ public class DefaultLanguageDetectorContextGenerator implements LanguageDetector
   protected final CharSequenceNormalizer normalizer;
 
   /**
-   * Creates a customizable @{@link DefaultLanguageDetectorContextGenerator} that computes ngrams from text
-   * @param minLength min ngrams chars
-   * @param maxLength max ngrams chars
-   * @param normalizers zero or more normalizers to
-   *                    be applied in to the text before extracting ngrams
+   * Creates a customizable {@link DefaultLanguageDetectorContextGenerator} that computes ngrams from text.
+   *
+   * @param minLength The min number of ngrams characters. Must be greater than {@code 0}.
+   * @param maxLength The max number of ngrams characters. Must be greater than {@code 0}
+   *                  and must be greater than {@code minLength}.
+   * @param normalizers zero or more normalizers to be applied in to the text before extracting ngrams.

Review Comment:
   Upper case Z?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org