You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/10 16:19:27 UTC

[GitHub] [lucene-solr] balmukundblr opened a new pull request #2345: Benchmark custom

balmukundblr opened a new pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345

# Description

Lucene Benchmark Scaling Problem with Reuters Corpus

While Indexing 1 million documents with reuters21578 (plain text Document derived from reuters21578 corpus), we observed that with higher number of Index threads, the Index throughput does not scale and degrades. Existing implementation with synchronization block allows only one thread to pick up a document/file from list, at any given time – this code is part of getNextDocData() in ReutersContentSource.java. With multiple index threads, this becomes a thread contention bottleneck and does not allow the system CPU resource to be used efficiently.

# Solution
We developed a strategy to distribute total number of files across multiple number of Indexing threads, so that these threads work independently and parallelly.

# Tests

We mainly modified existing getNextDocData(), which is not altering functionality, hence not added any new test cases.

- Passed existing tests

# Checklist

Please review the following and check all that apply:

- [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
[ ] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the `master` branch.
- [x] I have run `./gradlew check`.
[ ] I have added tests for my changes.
[ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

Posted by GitBox <gi...@apache.org>.

balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587723043



##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {
+     * if (nextFile >= inputFiles.size()) { // exhausted files, start a new round, unless forever set to false.
+     * if (!forever) {
+     *    throw new NoMoreDataException();
+     * }
+     * nextFile = 0;
+     * iteration++;
+     * }
+     * f = inputFiles.get(nextFile++);
+     * name = f.toRealPath() + "_" +iteration;
+     * }
+     */
+    if (!threadIndexCreated) {
+      createThreadIndex();
+    }
+
+    int index = (int) Thread.currentThread().getId() % threadIndex.length;
+    int fIndex = index + threadIndex[index] * threadIndex.length;
+    threadIndex[index]++;

Review comment:
       Although, getId() is controlled by JVM but in our case, all threadIndex are getting initialized at once. Hence, there is high chance of getting guaranteed sequence of thread id, as we also observed. However, we understand your concern and  tweaked our code in such a way that it  guaranteed to reach every possible int from 0 .. threadIndex.length. We achieved it by setting a unique thread name and parsing the same for calculating the index value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

Posted by GitBox <gi...@apache.org>.

balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587706698



##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {
+     * if (nextFile >= inputFiles.size()) { // exhausted files, start a new round, unless forever set to false.
+     * if (!forever) {
+     *    throw new NoMoreDataException();
+     * }
+     * nextFile = 0;
+     * iteration++;
+     * }
+     * f = inputFiles.get(nextFile++);
+     * name = f.toRealPath() + "_" +iteration;
+     * }
+     */
+    if (!threadIndexCreated) {

Review comment:
       Sure, will do the required changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2345: Benchmark custom

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r581398351



##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {
+     * if (nextFile >= inputFiles.size()) { // exhausted files, start a new round, unless forever set to false.
+     * if (!forever) {
+     *    throw new NoMoreDataException();
+     * }
+     * nextFile = 0;
+     * iteration++;
+     * }
+     * f = inputFiles.get(nextFile++);
+     * name = f.toRealPath() + "_" +iteration;
+     * }
+     */
+    if (!threadIndexCreated) {

Review comment:
       `if (threadIndexCreated == false) {` instead (to reduce chance of accidental future refactoring bugs)?  This likely won't pass our code style checker (`gradle precommit`).

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {
+     * if (nextFile >= inputFiles.size()) { // exhausted files, start a new round, unless forever set to false.
+     * if (!forever) {
+     *    throw new NoMoreDataException();
+     * }
+     * nextFile = 0;
+     * iteration++;
+     * }
+     * f = inputFiles.get(nextFile++);
+     * name = f.toRealPath() + "_" +iteration;
+     * }
+     */
+    if (!threadIndexCreated) {
+      createThreadIndex();
+    }
+
+    int index = (int) Thread.currentThread().getId() % threadIndex.length;
+    int fIndex = index + threadIndex[index] * threadIndex.length;
+    threadIndex[index]++;
+
+    // Sanity check, if # threads is greater than # input files, wrap index
+    if (index >= inputFilesSize) index %= inputFilesSize;

Review comment:
       Can you move the `index %= inputFilesSize` to newline and inside `{ ... }` body?

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {

Review comment:
       Just delete this old code?  You are replacing it with a more concurrent version, yay!

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -146,4 +172,11 @@ public synchronized void resetInputs() throws IOException {
     nextFile = 0;
     iteration = 0;
   }
+
+  private synchronized void createThreadIndex() {
+    if (!threadIndexCreated) {

Review comment:
       `== false` instead?  Or maybe change to `assert threadIndexCreated == false` since you also check this up above with a real `if` already?

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {
+     * if (nextFile >= inputFiles.size()) { // exhausted files, start a new round, unless forever set to false.
+     * if (!forever) {
+     *    throw new NoMoreDataException();
+     * }
+     * nextFile = 0;
+     * iteration++;
+     * }
+     * f = inputFiles.get(nextFile++);
+     * name = f.toRealPath() + "_" +iteration;
+     * }
+     */
+    if (!threadIndexCreated) {
+      createThreadIndex();
+    }
+
+    int index = (int) Thread.currentThread().getId() % threadIndex.length;
+    int fIndex = index + threadIndex[index] * threadIndex.length;
+    threadIndex[index]++;

Review comment:
       I'm confused how this approach ensures that we will indeed index every document in the `inputFiles`?
   
   `Thread.currentThread().getId() % threadIndex.length` is not guaranteed to reach every possible int from `0 .. threadIndex.length`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

Posted by GitBox <gi...@apache.org>.

balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587707919



##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -146,4 +172,11 @@ public synchronized void resetInputs() throws IOException {
     nextFile = 0;
     iteration = 0;
   }
+
+  private synchronized void createThreadIndex() {
+    if (!threadIndexCreated) {

Review comment:
       Sure, will do the required changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] balmukundblr commented on a change in pull request #2345: Benchmark custom

Posted by GitBox <gi...@apache.org>.

balmukundblr commented on a change in pull request #2345:
URL: https://github.com/apache/lucene-solr/pull/2345#discussion_r587707735



##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -102,19 +104,43 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
     Path f = null;
     String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    int inputFilesSize = inputFiles.size();
+
+    /*
+     * synchronized (this) {

Review comment:
       Sure, will delete the commented codes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org