You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/02/03 21:28:18 UTC

[GitHub] [lucene-solr] msokolov opened a new pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

msokolov opened a new pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235
 
 
   # Description
   Briefly, introduces a new object for sharing scores and counts in `TopFieldCollector` in order to get earlier, but still accurate, termination. See the Jira for more dicsussion

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374964273
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
 
 Review comment:
   It's only ever written or read (except in a nonthreaded unit test) in `updateWorstHit`, called by `updateLeafState`, which is synchronized, so I think it's correct?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374715741
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
 
 Review comment:
   Maybe rename to `worstLeafState`?  Can/should this also be `private`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374954900
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
 
 Review comment:
   Yes, that's right. I'm changing to 5 (leaf collectors should call for every 32d collected hit), but this nominal value is never used in practice.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r375474191
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
+      Collections.sort(leafStates);
+      while (leafStates.size() > 1) {
+        LeafState worst = leafStates.get(leafStates.size() - 1);
+        if (totalCollected - numExcludedBound - worst.resultCount >= totalToCollect) {
+          //System.out.println(" and remove");
+          // We don't need the worst leaf in order to get enough results, so
+          // remember how many results (upper bound) it accounted for
+          numExcludedBound += worst.resultCount;
+          // and remove it from the list of worstHits
+          leafStates.remove(leafStates.size() - 1);
+        } else {
+          break;
+        }
+      }
+      // and update the worstHit with the worst one that remains after exclusion
+      leafState = leafStates.get(leafStates.size() - 1);
+    }
+  }
+
+  /**
+   * For tracking the worst scoring hit and total number of hits collected by each leaf
+   */
+  class LeafState implements Comparable<LeafState> {
+
+    private float worstScore;
+    int resultCount;
+
+    LeafState() {
+      this.worstScore = -1;
+    }
+
+    void update(float score) {
+      assert score >= this.worstScore;
+      this.worstScore = score;
+      ++this.resultCount;
+    }
+
+    @Override
+    public int compareTo(LeafState o) {
+      return Float.compare(worstScore, o.worstScore);
 
 Review comment:
   OK multiple fields is not a problem. After a long detour (where I learned how our test seeds work), I found and fixed a small bug when we use multiple sort keys with inverse ordering. Also, I am posting a new commit that switches to using double for scores.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374966534
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
 
 Review comment:
   Yes it's true that calling threads must abide by the calling frequency, but the distinction between competitive and noncompetitive hits is moot since it will never be called in the non-competitive case: if the leaf knows its hit is noncompetitive, it is expected to terminate. I'll beef up the javadocs here

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374712567
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
 
 Review comment:
   Hmm, but we also call this from `IndexSearcher` "for real"?  Maybe remove this comment?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374728166
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
+      Collections.sort(leafStates);
 
 Review comment:
   A priority queue should be better here, in general?  It can insert and return the worst `LeafState` in `O(log(N))` time, amortized, whereas this sort is `O(N * log(N))` cost.  But maybe since the number of leaves is typically "smallish", it would not matter much in practice (the constant factors hidden in those big-Os would offset any gains)?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374732814
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##########
 @@ -310,14 +338,18 @@ public void collect(int doc) throws IOException {
 
   final int numHits;
   final HitsThresholdChecker hitsThresholdChecker;
-  final FieldComparator.RelevanceComparator firstComparator;
+  final FieldComparator.RelevanceComparator relevanceComparator;
   final boolean canSetMinScore;
+  final FieldComparator<?> firstComparator;
 
-  // an accumulator that maintains the maximum of the segment's minimum competitive scores
+  // an accumulator that maintains the maximum of the segments' minimum competitive scores
 
 Review comment:
   ;)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374712903
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
 
 Review comment:
   Comment is stale?  Should it say `by default, we use 2^2-1 to check the remainder with a bitwise operation`?
   
   Does this default mean that every thread will check in ever 4 competitive hits (hits that it retained, e.g. in its leaf-private PriorityQueue) it collected?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374974584
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
+      Collections.sort(leafStates);
 
 Review comment:
   I suppose we could re-insert all items into a PQ here instead of strictly sorting them though, yeah might be worth it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374731757
 
 

 ##########
 File path: lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java
 ##########
 @@ -690,4 +697,94 @@ public void testRandomMinCompetitiveScore() throws Exception {
     dir.close();
   }
 
+  public void testConcurrentScoreboardTermination() throws Exception {
+    try(Directory dir = newDirectory();
+        IndexWriter w = new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(NoMergePolicy.INSTANCE))) {
+      Document doc = new Document();
+      w.addDocuments(Arrays.asList(doc, doc, doc, doc, doc));
+      w.flush();
+      w.addDocuments(Arrays.asList(doc, doc, doc, doc, doc, doc));
+      w.flush();
+      w.addDocuments(Arrays.asList(doc, doc));
+      w.flush();
+      try (IndexReader reader = DirectoryReader.open(w)) {
+        assertEquals(3, reader.leaves().size());
+
+        Sort sort = new Sort(SortField.FIELD_DOC);
+        CollectorManager<TopFieldCollector, TopFieldDocs> manager =
+            TopFieldCollector.createSharedManager(sort, 2, null, 0);
+        TopFieldCollector collector = manager.newCollector();
+        TopFieldCollector collector2 = manager.newCollector();
+        assertNotNull(collector.maxScoreTerminator);
+        assertSame(collector.maxScoreTerminator, collector2.maxScoreTerminator);
+        assertEquals(0, collector.maxScoreTerminator.totalCollected);
+        assertEquals(2, collector.maxScoreTerminator.totalToCollect);
+        collector.maxScoreTerminator.setIntervalBits(0);
+
+        LeafCollector leafCollector0 = collector.getLeafCollector(reader.leaves().get(0));
+        LeafCollector leafCollector1 = collector2.getLeafCollector(reader.leaves().get(1));
+        LeafCollector leafCollector2 = collector2.getLeafCollector(reader.leaves().get(2));
+
+        leafCollector1.collect(0);
+        leafCollector0.collect(0); // we have collected numHits=2 docs, but there are better ones remaining
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector1.collect(1));
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector0.collect(1)); // now we have all the top docs
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector2.collect(0));
+      }
+    }
+  }
+
+  public void testRandomMaxScoreTermination() throws Exception {
 
 Review comment:
   Have you tried beasting this test?  If there are any bugs in the termination logic it seems like this test (given enough iterations) might uncover them ... `newIndexWriterConfig` and `RandomIndexWriter` are especially fun since they so heavily randomize how the docs are written to segments ...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374717026
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
 
 Review comment:
   So this means every thread must check in on precisely each `interval` hits, and it looks like it's every `interval` hits, not just the hits that were competitive?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on issue #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on issue #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#issuecomment-581942691
 
 
   Hmm the strange looking empty file `lucene/core/.attach_pid19164` was added here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374961543
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
+      Collections.sort(leafStates);
+      while (leafStates.size() > 1) {
+        LeafState worst = leafStates.get(leafStates.size() - 1);
+        if (totalCollected - numExcludedBound - worst.resultCount >= totalToCollect) {
+          //System.out.println(" and remove");
+          // We don't need the worst leaf in order to get enough results, so
+          // remember how many results (upper bound) it accounted for
+          numExcludedBound += worst.resultCount;
+          // and remove it from the list of worstHits
+          leafStates.remove(leafStates.size() - 1);
+        } else {
+          break;
+        }
+      }
+      // and update the worstHit with the worst one that remains after exclusion
+      leafState = leafStates.get(leafStates.size() - 1);
+    }
+  }
+
+  /**
+   * For tracking the worst scoring hit and total number of hits collected by each leaf
+   */
+  class LeafState implements Comparable<LeafState> {
+
+    private float worstScore;
+    int resultCount;
+
+    LeafState() {
+      this.worstScore = -1;
+    }
+
+    void update(float score) {
+      assert score >= this.worstScore;
+      this.worstScore = score;
+      ++this.resultCount;
+    }
+
+    @Override
+    public int compareTo(LeafState o) {
+      return Float.compare(worstScore, o.worstScore);
 
 Review comment:
   We only terminate based on the first component of the score; we basically behave as if subsequent components were all equal. Hmm could be a lurking bug there. We might have to limit this optimization to Sort where this only a single field. Given that, is float sufficient? I guess we could be sorting on a double-valued field, so we should switch to double.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374716208
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
 
 Review comment:
   Hmm why not also `private`?  Should it be `volatile` (if it will be read/written across threads w/o `synchronized`)?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374739163
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
+      Collections.sort(leafStates);
+      while (leafStates.size() > 1) {
+        LeafState worst = leafStates.get(leafStates.size() - 1);
+        if (totalCollected - numExcludedBound - worst.resultCount >= totalToCollect) {
+          //System.out.println(" and remove");
+          // We don't need the worst leaf in order to get enough results, so
+          // remember how many results (upper bound) it accounted for
+          numExcludedBound += worst.resultCount;
+          // and remove it from the list of worstHits
+          leafStates.remove(leafStates.size() - 1);
+        } else {
+          break;
+        }
+      }
+      // and update the worstHit with the worst one that remains after exclusion
+      leafState = leafStates.get(leafStates.size() - 1);
+    }
+  }
+
+  /**
+   * For tracking the worst scoring hit and total number of hits collected by each leaf
+   */
+  class LeafState implements Comparable<LeafState> {
+
+    private float worstScore;
+    int resultCount;
+
+    LeafState() {
+      this.worstScore = -1;
+    }
+
+    void update(float score) {
+      assert score >= this.worstScore;
+      this.worstScore = score;
+      ++this.resultCount;
+    }
+
+    @Override
+    public int compareTo(LeafState o) {
+      return Float.compare(worstScore, o.worstScore);
 
 Review comment:
   Hmm why `Float` and not `Double`?  What if the sort was a compound sort (sorting on multiple fields)?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374736845
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
 
 Review comment:
   Maybe state up front that this only applies in the static (index time) sort case, when the query's requested sort order is congruent (also sorting by that index time sort)?
   
   And also (ideally?) applies to the concurrent case, where separate threads are searching each segment.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374713297
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
 
 Review comment:
   Maybe name `addLeafState` or `addLeaf`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374734594
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##########
 @@ -69,6 +69,107 @@ public void setScorer(Scorable scorer) throws IOException {
     }
   }
 
+  private abstract class TopFieldLeafCollector extends MultiComparatorLeafCollector {
 
 Review comment:
   This source (`TopFieldCollector.java`) has really become too crazy ... we should at some point pull all these inner classes out to their own source files ... thank you for opening https://issues.apache.org/jira/browse/LUCENE-9202

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on issue #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on issue #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#issuecomment-594828845
 
 
   This PR can no longer be rebased cleanl and I can't see what combination of git magic to use to preserve history here while rebasing; also, since it uses the all-caps label, it generates a lot of email noise. I'll open a new PR with the latest so this history can be preserved 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374969973
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##########
 @@ -69,6 +69,107 @@ public void setScorer(Scorable scorer) throws IOException {
     }
   }
 
+  private abstract class TopFieldLeafCollector extends MultiComparatorLeafCollector {
 
 Review comment:
   hmm maybe we can do more there, but given how this is stacked on the refactoring PR in that issue I'd rather avoid to much more refactoring rn.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374733272
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##########
 @@ -310,14 +338,18 @@ public void collect(int doc) throws IOException {
 
   final int numHits;
   final HitsThresholdChecker hitsThresholdChecker;
-  final FieldComparator.RelevanceComparator firstComparator;
+  final FieldComparator.RelevanceComparator relevanceComparator;
   final boolean canSetMinScore;
+  final FieldComparator<?> firstComparator;
 
-  // an accumulator that maintains the maximum of the segment's minimum competitive scores
+  // an accumulator that maintains the maximum of the segments' minimum competitive scores
   final MaxScoreAccumulator minScoreAcc;
   // the current local minimum competitive score already propagated to the underlying scorer
   float minCompetitiveScore;
 
+  // Enables global early termination using minimum competitive scores and collected counts of all segments
 
 Review comment:
   Maybe also say "with concurrent search threads"?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374967483
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
 
 Review comment:
   IndexSearcher bypasses this `if (executor == null || leafSlices.length <= 1) {` . We update leafStates by removing them as we terminate the leaves though, so this will eventually drop to 1.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov closed pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov closed pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r375006270
 
 

 ##########
 File path: lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java
 ##########
 @@ -690,4 +697,94 @@ public void testRandomMinCompetitiveScore() throws Exception {
     dir.close();
   }
 
+  public void testConcurrentScoreboardTermination() throws Exception {
+    try(Directory dir = newDirectory();
+        IndexWriter w = new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(NoMergePolicy.INSTANCE))) {
+      Document doc = new Document();
+      w.addDocuments(Arrays.asList(doc, doc, doc, doc, doc));
+      w.flush();
+      w.addDocuments(Arrays.asList(doc, doc, doc, doc, doc, doc));
+      w.flush();
+      w.addDocuments(Arrays.asList(doc, doc));
+      w.flush();
+      try (IndexReader reader = DirectoryReader.open(w)) {
+        assertEquals(3, reader.leaves().size());
+
+        Sort sort = new Sort(SortField.FIELD_DOC);
+        CollectorManager<TopFieldCollector, TopFieldDocs> manager =
+            TopFieldCollector.createSharedManager(sort, 2, null, 0);
+        TopFieldCollector collector = manager.newCollector();
+        TopFieldCollector collector2 = manager.newCollector();
+        assertNotNull(collector.maxScoreTerminator);
+        assertSame(collector.maxScoreTerminator, collector2.maxScoreTerminator);
+        assertEquals(0, collector.maxScoreTerminator.totalCollected);
+        assertEquals(2, collector.maxScoreTerminator.totalToCollect);
+        collector.maxScoreTerminator.setIntervalBits(0);
+
+        LeafCollector leafCollector0 = collector.getLeafCollector(reader.leaves().get(0));
+        LeafCollector leafCollector1 = collector2.getLeafCollector(reader.leaves().get(1));
+        LeafCollector leafCollector2 = collector2.getLeafCollector(reader.leaves().get(2));
+
+        leafCollector1.collect(0);
+        leafCollector0.collect(0); // we have collected numHits=2 docs, but there are better ones remaining
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector1.collect(1));
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector0.collect(1)); // now we have all the top docs
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector2.collect(0));
+      }
+    }
+  }
+
+  public void testRandomMaxScoreTermination() throws Exception {
 
 Review comment:
   OK I beasted at 1000 times, all good, but then I added extra SortField to the Sort and it hit issues, so I'll need to address.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374969554
 
 

 ##########
 File path: lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java
 ##########
 @@ -690,4 +697,94 @@ public void testRandomMinCompetitiveScore() throws Exception {
     dir.close();
   }
 
+  public void testConcurrentScoreboardTermination() throws Exception {
+    try(Directory dir = newDirectory();
+        IndexWriter w = new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(NoMergePolicy.INSTANCE))) {
+      Document doc = new Document();
+      w.addDocuments(Arrays.asList(doc, doc, doc, doc, doc));
+      w.flush();
+      w.addDocuments(Arrays.asList(doc, doc, doc, doc, doc, doc));
+      w.flush();
+      w.addDocuments(Arrays.asList(doc, doc));
+      w.flush();
+      try (IndexReader reader = DirectoryReader.open(w)) {
+        assertEquals(3, reader.leaves().size());
+
+        Sort sort = new Sort(SortField.FIELD_DOC);
+        CollectorManager<TopFieldCollector, TopFieldDocs> manager =
+            TopFieldCollector.createSharedManager(sort, 2, null, 0);
+        TopFieldCollector collector = manager.newCollector();
+        TopFieldCollector collector2 = manager.newCollector();
+        assertNotNull(collector.maxScoreTerminator);
+        assertSame(collector.maxScoreTerminator, collector2.maxScoreTerminator);
+        assertEquals(0, collector.maxScoreTerminator.totalCollected);
+        assertEquals(2, collector.maxScoreTerminator.totalToCollect);
+        collector.maxScoreTerminator.setIntervalBits(0);
+
+        LeafCollector leafCollector0 = collector.getLeafCollector(reader.leaves().get(0));
+        LeafCollector leafCollector1 = collector2.getLeafCollector(reader.leaves().get(1));
+        LeafCollector leafCollector2 = collector2.getLeafCollector(reader.leaves().get(2));
+
+        leafCollector1.collect(0);
+        leafCollector0.collect(0); // we have collected numHits=2 docs, but there are better ones remaining
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector1.collect(1));
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector0.collect(1)); // now we have all the top docs
+        expectThrows(CollectionTerminatedException.class, () -> leafCollector2.collect(0));
+      }
+    }
+  }
+
+  public void testRandomMaxScoreTermination() throws Exception {
 
 Review comment:
   Good idea - I'll try it! There is definitely the potential for a lurking bug here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374969190
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
+      Collections.sort(leafStates);
 
 Review comment:
   Well, also we do not care what the state is in between the relatively infrequent events (I think?) when we cross the `totalToCollect` boundary, so we save the effort of many calls to `updateTop` or similar. In fact I think we can improve this here by only calling `updateWorstHit` when `totalCollected - numExcludedBound >= totalToCollect` rather than the current `totalCollected >= totalToCollect`  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374725435
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
+  void setIntervalBits(int bitCount) {
+    interval = 1 << bitCount;
+    intervalMask = interval - 1;
+  }
+
+  /**
+   * Called by leaf collectors periodically to update their progress.
+   * @param newLeafState the leaf collector's current lowest score
+   * @return whether the collector should terminate
+   */
+  synchronized boolean update(LeafState newLeafState) {
+    totalCollected += interval;
+    //System.out.println(" scoreboard totalCollected = " + totalCollected + "/" + totalToCollect + " "
+    //  + newLeafState + " ? " + leafState + ":" + newLeafState.compareTo(leafState));
+    if (newLeafState.compareTo(leafState) >= 0 || newLeafState.resultCount >= totalToCollect) {
+      leafState = newLeafState;
+      if (totalCollected >= totalToCollect) {
+        // in this case, we have enough scores: remove the worst leaves
+        updateWorstHit();
+        // and tell the current leaf collector to terminate since it can
+        // no longer contribute any top hits
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private void updateWorstHit() {
+    //System.out.println("scoreboard updateWorstHit");
+    if (leafStates.size() > 1) {
 
 Review comment:
   Hmm, but if there is exactly one leaf we never set the worst `LeafState`?  Do we somehow bypass all this logic if the index has only one segment?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #1235: LUCENE-8929: parallel early termination sharing counts and max scores across leaves
URL: https://github.com/apache/lucene-solr/pull/1235#discussion_r374954381
 
 

 ##########
 File path: lucene/core/src/java/org/apache/lucene/search/MaxScoreTerminator.java
 ##########
 @@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * MaxScoreTerminator is notified periodically by leaf collectors calling {@link #update}
+ * with their worst (ie maximum) score
+ * and how many hits they have collected.  When enough hits are collected, Scoreboard notifies
+ * noncompetitive leaf collectors when they can stop (early terminate) by returning true from
+ * its {@link #update} method.
+ * 
+ * <p>At any moment, N leaves have reported their counts of documents collected; documents are
+ * collected in score order, so these counts represent the best for each leaf. And we also track
+ * the scores of the lowest-scoring (most recently collected) document in each leaf.</p>
+ *
+ * <p>Once the total number of documents collected reaches the requested total (numHits),
+ * the worst-scoring leaf can no longer contribute any documents to the results, so it can be
+ * terminated, and any leaves whose scores rise above that worst score is no longer competitive and
+ * can also be terminated. If we kept a global priority queue we could update the global maximum
+ * competitive score, and use that as a termination threshold, but assuming this to be too costly
+ * due to thread contention, we seek to more cheaply update an upper bound on the worst score.
+ * Specifically, when a leaf is terminated, if the remaining leaves together have collected >= numHits,
+ * then we can update the maximum to the max of *their* max scores, excluding the terminated leaf's max
+ * from consideration.</p>
+ *
+ * <p> In practice this leads to a good bound on the number of documents collected, which tend to
+ * exceed numHits by a small factor.  When the documents are evenly distributed among N segments,
+ * we expect to collect approximately (N+1/N) * numHits documents. In a worst case, where *all*
+ * the best documents are in a single segment, we expect to collect something O(log N) ie
+ * (1/N + 1/N-1 + ... + 1) * numHits documents, which is still much better than
+ * the N * numHits we would collect with a naive strategy.
+ * </p>
+ */
+class MaxScoreTerminator {
+
+  // we use 2^5-1 to check the remainder with a bitwise operation
+  // private static final int DEFAULT_INTERVAL_BITS = 10;
+  private static final int DEFAULT_INTERVAL_BITS = 2;
+  private int interval;
+  int intervalMask;
+
+  /** The worst score for each leaf */
+  private final List<LeafState> leafStates;
+
+  /** The total number of docs to collect: from the Collector's numHits */
+  final int totalToCollect;
+
+  /** An upper bound on the number of docs "excluded" from max-score accounting due to early termination. */
+  private int numExcludedBound;
+
+  /** A lower bound on the total hits collected by all leaves */
+  int totalCollected;
+
+  /** the worst hit over all */
+  LeafState leafState;
+
+  MaxScoreTerminator(int totalToCollect) {
+    leafStates = new ArrayList<>();
+    this.totalToCollect = totalToCollect;
+    setIntervalBits(DEFAULT_INTERVAL_BITS);
+  }
+
+  synchronized LeafState add() {
+    LeafState newLeafState = new LeafState();
+    leafStates.add(newLeafState);
+    if (leafState == null) {
+      leafState = newLeafState;
+    }
+    return newLeafState;
+  }
+
+  // for testing
 
 Review comment:
   Yeah, I'll add a javadoc explaining how to call this and what value to use

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org