You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by GitBox <gi...@apache.org> on 2022/12/04 11:55:18 UTC

[GitHub] [opennlp] mawiesne opened a new pull request, #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

mawiesne opened a new pull request, #445:
URL: https://github.com/apache/opennlp/pull/445

   Change
   -
   
   - adds missing JavaDoc
   - improves existing documentation for clarity
   - removes superfluous text
   - adds package-info.java for opennlp.tools.langdetect package
   - adds 'final' modifier where useful and applicable
   - adds 'Override' annotation where useful and applicable
   - removes deprecated `calculateProbability(StringList tokens)` method in LanguageModel and privatized related impl in NGramLanguageModel
   - fixes some typos
   
   Tasks
   -
   Thank you for contributing to Apache OpenNLP.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
        in the commit message?
   
   - [x] Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
   
   - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
   
   - [x] Is your initial contribution a single, squashed commit?
   
   ### For code changes:
   - [x] Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
   - [ ] Have you written or updated unit tests to verify your changes?
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
   - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?
   
   ### For documentation related changes:
   - [x] Have you ensured that format looks appropriate for the output in which it is rendered?
   
   ### Note:
   Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] kinow commented on a diff in pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

kinow commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038969216


##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetector.java:
##########
@@ -20,14 +20,29 @@
 import java.io.Serializable;
 
 /**
- * The interface for LanguageDetector which provide the @{@link Language} according to the context.
+ * The interface for {@link LanguageDetector} which predicts the {@link Language} for a context.
  */
 public interface LanguageDetector extends Serializable {
 
+  /**
+   * Predicts the {@link Language languages} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the predicted languages
+   */
   Language[] predictLanguages(CharSequence content);
 
+  /**
+   * Predicts the {@link Language} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the language with the highest confidence
+   */
   Language predictLanguage(CharSequence content);
 
+  /**
+   * @return Retrieves an array of language (codes) that are supported by a {@link LanguageDetector}.

Review Comment:
   +1, we can improve with each PR/iteration. Thanks again for these enhancements!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] mawiesne commented on a diff in pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

mawiesne commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038968895


##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java:
##########
@@ -34,11 +34,12 @@ public class DefaultLanguageDetectorContextGenerator implements LanguageDetector
   protected final CharSequenceNormalizer normalizer;
 
   /**
-   * Creates a customizable @{@link DefaultLanguageDetectorContextGenerator} that computes ngrams from text
-   * @param minLength min ngrams chars
-   * @param maxLength max ngrams chars
-   * @param normalizers zero or more normalizers to
-   *                    be applied in to the text before extracting ngrams
+   * Creates a customizable {@link DefaultLanguageDetectorContextGenerator} that computes ngrams from text.
+   *
+   * @param minLength The min number of ngrams characters. Must be greater than {@code 0}.
+   * @param maxLength The max number of ngrams characters. Must be greater than {@code 0}
+   *                  and must be greater than {@code minLength}.
+   * @param normalizers zero or more normalizers to be applied in to the text before extracting ngrams.

Review Comment:
   Ya, should have written Z here. Will take note for the next PR (postag package) and make it pretty there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] mawiesne commented on pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

mawiesne commented on PR #445:
URL: https://github.com/apache/opennlp/pull/445#issuecomment-1336411508

   @kinow next PR is just being created for `opennlp.tools.postag` package right now. Hope the "coin laundry machine" will terminate in time ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] kinow commented on pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

kinow commented on PR #445:
URL: https://github.com/apache/opennlp/pull/445#issuecomment-1336412280

   Ha! Done!! @mawiesne :+1: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] mawiesne commented on a diff in pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

mawiesne commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038969772


##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -91,6 +111,7 @@ public double calculateProbability(String... tokens) {
   }
 
   @Override
+  @Deprecated

Review Comment:
   The recommendation is found in `opennlp.tools.languagemodel.LanguageModel`. Did not want to cnp that over. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] kinow commented on pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

kinow commented on PR #445:
URL: https://github.com/apache/opennlp/pull/445#issuecomment-1336412021

   > @kinow next PR is just being created for `opennlp.tools.postag` package right now. Hope the "coin laundry machine" will terminate in time ;)
   
   I'm trying to log in to JIRA from here to update both PR & JIRA. You might be able to rebase your work in a couple of minutes I think :slightly_smiling_face: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] kinow commented on a diff in pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

kinow commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038965583


##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetector.java:
##########
@@ -20,14 +20,29 @@
 import java.io.Serializable;
 
 /**
- * The interface for LanguageDetector which provide the @{@link Language} according to the context.
+ * The interface for {@link LanguageDetector} which predicts the {@link Language} for a context.
  */
 public interface LanguageDetector extends Serializable {
 
+  /**
+   * Predicts the {@link Language languages} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the predicted languages
+   */
   Language[] predictLanguages(CharSequence content);
 
+  /**
+   * Predicts the {@link Language} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the language with the highest confidence
+   */
   Language predictLanguage(CharSequence content);
 
+  /**
+   * @return Retrieves an array of language (codes) that are supported by a {@link LanguageDetector}.

Review Comment:
   Some @return start with upper case, others don't. Don't really bother me, but just in case there is a convention or intention to stardardize it 👍 (no need to change anything if you don't want too, really).



##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -114,16 +135,30 @@ public StringList predictNextTokens(StringList tokens) {
     return token;
   }
 
+  private double calculateProbability(StringList tokens) {
+    double probability = 0d;
+    if (size() > 0) {
+      for (StringList ngram : NGramUtils.getNGrams(tokens, n)) {
+        double score = stupidBackoff(ngram);
+        probability += StrictMath.log(score);
+        if (Double.isNaN(probability)) {
+          probability = 0d;
+          break;
+        }
+      }
+      probability = StrictMath.exp(probability);
+    }
+    return probability;
+  }
+
   @Override
   public String[] predictNextTokens(String... tokens) {
     double maxProb = Double.NEGATIVE_INFINITY;
     String[] token = null;
 
     for (StringList ngram : this) {
       String[] sequence = new String[ngram.size() + tokens.length];
-      for (int i = 0; i < tokens.length; i++) {
-        sequence[i] = tokens[i];
-      }
+      System.arraycopy(tokens, 0, sequence, 0, tokens.length);

Review Comment:
   👏



##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -91,6 +111,7 @@ public double calculateProbability(String... tokens) {
   }
 
   @Override
+  @Deprecated

Review Comment:
   Is there a recommendation on what to do if calling this deprecated method?  What to use instead?



##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java:
##########
@@ -34,11 +34,12 @@ public class DefaultLanguageDetectorContextGenerator implements LanguageDetector
   protected final CharSequenceNormalizer normalizer;
 
   /**
-   * Creates a customizable @{@link DefaultLanguageDetectorContextGenerator} that computes ngrams from text
-   * @param minLength min ngrams chars
-   * @param maxLength max ngrams chars
-   * @param normalizers zero or more normalizers to
-   *                    be applied in to the text before extracting ngrams
+   * Creates a customizable {@link DefaultLanguageDetectorContextGenerator} that computes ngrams from text.
+   *
+   * @param minLength The min number of ngrams characters. Must be greater than {@code 0}.
+   * @param maxLength The max number of ngrams characters. Must be greater than {@code 0}
+   *                  and must be greater than {@code minLength}.
+   * @param normalizers zero or more normalizers to be applied in to the text before extracting ngrams.

Review Comment:
   Upper case Z?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] mawiesne commented on a diff in pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

mawiesne commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038968731


##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetector.java:
##########
@@ -20,14 +20,29 @@
 import java.io.Serializable;
 
 /**
- * The interface for LanguageDetector which provide the @{@link Language} according to the context.
+ * The interface for {@link LanguageDetector} which predicts the {@link Language} for a context.
  */
 public interface LanguageDetector extends Serializable {
 
+  /**
+   * Predicts the {@link Language languages} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the predicted languages
+   */
   Language[] predictLanguages(CharSequence content);
 
+  /**
+   * Predicts the {@link Language} for the full {@code content} length.
+   *
+   * @param content The textual content to detect potential {@link Language languages} from.
+   * @return the language with the highest confidence
+   */
   Language predictLanguage(CharSequence content);
 
+  /**
+   * @return Retrieves an array of language (codes) that are supported by a {@link LanguageDetector}.

Review Comment:
   Ya, noticed this already. I'd prefer upper-cased start of sentences. Found both variants all over the place, so there seems to be no real convention in place. C/Should be sanitized in future changes, I guess.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [opennlp] kinow merged pull request #445: OPENNLP-1403 Enhance JavaDoc in opennlp.tools.langdetect and opennlp.tools.languagemodel packages

Posted by GitBox <gi...@apache.org>.

kinow merged PR #445:
URL: https://github.com/apache/opennlp/pull/445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org