You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/22 12:30:45 UTC

[GitHub] [lucene-solr] donnerpeter opened a new pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

donnerpeter opened a new pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   Hunspell has a number of time limits to ensure it doesn't take forever applying potentially exponential rules
   
   # Solution
   
   Add an equivalent of its `#define TIMELIMIT_GLOBAL (CLOCKS_PER_SEC / 4)` for suggestions. Later we'll see if we need other time limits.
   
   # Tests
   
   `TestHunspell.testSuggestionTimeLimit`. It also became possible to raise the number of checked words in `TestPerformance.fr`: before even the second misspelled word took 15 seconds to `suggest`.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580258222



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -540,7 +566,36 @@ boolean acceptsStem(int formID) {
     if (word.contains("-") && suggestions.stream().noneMatch(s -> s.contains("-"))) {
       suggestions.addAll(modifyChunksBetweenDashes(word));
     }
+  }
+
+  private Runnable checkTimeLimit(String word, WordCase wordCase, Set<String> suggestions) {
+    return new Runnable() {
+      final long limit = System.nanoTime() + TimeUnit.MILLISECONDS.toNanos(SUGGEST_TIME_LIMIT);

Review comment:
       TIL, thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580272140



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -49,20 +51,25 @@
  * shared by multiple spell-checkers in different threads).
  */
 public class Hunspell {
+  static final long SUGGEST_TIME_LIMIT = 250;

Review comment:
       for the case 2, I just want to be extra paranoid and avoid any time-dependent tests. We've got battle scars here, we've had to wrestle with bugs in mac OS causing nanotime to go backwards. At my day job I recently spent a lot of time debugging such issues with hypervisors in jenkins (crazy jumps/stalls etc). It is very annoying to debug these issues even when code is doing the correct thing and using monotonic clock (like this PR).
   
   So my point is the machine doesn't necessarily have to be crazy "fast" or "slow", especially with virtualization. Crazy stuff can and will happen, so if we can add extra caution for the tests it will be helpful in preventing flaky failures




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir merged pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
rmuir merged pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580215786



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -540,7 +566,36 @@ boolean acceptsStem(int formID) {
     if (word.contains("-") && suggestions.stream().noneMatch(s -> s.contains("-"))) {
       suggestions.addAll(modifyChunksBetweenDashes(word));
     }
+  }
+
+  private Runnable checkTimeLimit(String word, WordCase wordCase, Set<String> suggestions) {
+    return new Runnable() {
+      final long limit = System.nanoTime() + TimeUnit.MILLISECONDS.toNanos(SUGGEST_TIME_LIMIT);
+      int invocationCounter = 100;
+
+      @Override
+      public void run() {
+        checkCanceled.run();
+        if (--invocationCounter <= 0) {
+          if (System.nanoTime() > limit) {

Review comment:
       We get system time only on each 100th call, like Hunspell does




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580258132



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -49,20 +51,25 @@
  * shared by multiple spell-checkers in different threads).
  */
 public class Hunspell {
+  static final long SUGGEST_TIME_LIMIT = 250;

Review comment:
       For case 1, NO_TIMEOUT looks like a great idea, thanks!
   For case 2, maybe it can wait until so fast computers really appear?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580216474



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/ModifyingSuggester.java
##########
@@ -68,12 +69,11 @@
         tryVariationsOf(speller.dictionary.toTitleCase(low));
       }
 
-      return result.stream()
-          .map(s -> capitalizeAfterSpace(low, s))
-          .collect(Collectors.toCollection(LinkedHashSet::new));
+      List<String> adjusted =
+          result.stream().map(s -> capitalizeAfterSpace(low, s)).collect(Collectors.toList());
+      result.clear();
+      result.addAll(adjusted);

Review comment:
       We now modify the passed (partial) collection instead of returning




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580238380



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -540,7 +566,36 @@ boolean acceptsStem(int formID) {
     if (word.contains("-") && suggestions.stream().noneMatch(s -> s.contains("-"))) {
       suggestions.addAll(modifyChunksBetweenDashes(word));
     }
+  }
+
+  private Runnable checkTimeLimit(String word, WordCase wordCase, Set<String> suggestions) {
+    return new Runnable() {
+      final long limit = System.nanoTime() + TimeUnit.MILLISECONDS.toNanos(SUGGEST_TIME_LIMIT);

Review comment:
       What happens in the case of overflow here, the javadocs for `nanotime()` state:
   > one should use t1 - t0 < 0, not t1 < t0, because of the possibility of numerical overflow.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
donnerpeter commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580215786



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -540,7 +566,36 @@ boolean acceptsStem(int formID) {
     if (word.contains("-") && suggestions.stream().noneMatch(s -> s.contains("-"))) {
       suggestions.addAll(modifyChunksBetweenDashes(word));
     }
+  }
+
+  private Runnable checkTimeLimit(String word, WordCase wordCase, Set<String> suggestions) {
+    return new Runnable() {
+      final long limit = System.nanoTime() + TimeUnit.MILLISECONDS.toNanos(SUGGEST_TIME_LIMIT);
+      int invocationCounter = 100;
+
+      @Override
+      public void run() {
+        checkCanceled.run();
+        if (--invocationCounter <= 0) {
+          if (System.nanoTime() > limit) {

Review comment:
       We get system time only on each 100th call, like Hunspell does, since that isn't exactly free




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] rmuir commented on a change in pull request #2414: LUCENE-9800: Hunspell: put a time limit on suggestion calculation

Posted by GitBox <gi...@apache.org>.
rmuir commented on a change in pull request #2414:
URL: https://github.com/apache/lucene-solr/pull/2414#discussion_r580231403



##########
File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
##########
@@ -49,20 +51,25 @@
  * shared by multiple spell-checkers in different threads).
  */
 public class Hunspell {
+  static final long SUGGEST_TIME_LIMIT = 250;

Review comment:
       I don't want to make this complicated, but for test purposes can we make this tweakable (even via package private method?)
   
   I'm concerned about unit tests here being dependent on how fast the underlying system is, we don't want false failures. I have two concerns:
   1) suggester tests unrelated (to this feature) that might sporadically fail because the system happens to be slow. 
   2) suggester tests related to this feature that might sporadically fail because system is too fast.
   
   There's a number of ways to ensure the tests stay robust. For example, in case 1, if we could set a TimeoutPolicy.NO_TIMEOUT (or similar name), we'd never hit false failures, even if say jenkins machine was running with load average in the hundreds. 
   
   For case 2, where we really want to be 100% sure the timeout happens, even if it is on @mikemccand super-fast machine, it would be nice if we could set the limit to 0 (even if its a package-private setter just for testing or something). 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org