You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/06/04 14:06:45 UTC

[GitHub] [lucene] mikemccand commented on a change in pull request #132: Parallel processing

mikemccand commented on a change in pull request #132:
URL: https://github.com/apache/lucene/pull/132#discussion_r645598176



##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/TaskSequence.java
##########
@@ -340,12 +340,17 @@ private int doParallelTasks() throws Exception {
 
     initTasksArray();
     ParallelTask t[] = runningParallelTasks = new ParallelTask[repetitions * tasks.size()];
+    //Get number of parallel threads from algo file and set it to use in ReuersContentSource.java's docCountArrInit()
+    this.getRunData().getConfig().setNumThreads(t.length);
     // prepare threads
     int index = 0;
     for (int k = 0; k < repetitions; k++) {
       for (int i = 0; i < tasksArray.length; i++) {
         final PerfTask task = tasksArray[i].clone();
-        t[index++] = new ParallelTask(task);
+        t[index] = new ParallelTask(task);
+        //Setting unique ThreadName with index value which is used in ReuersContentSource.java's getNextDocData()

Review comment:
       Can you strengthen the comment to state that we should NOT change this thread name, unless we also fix the String -> int parsing logic in `ReutersContentSource`?
   
   Actually, could we factor out this string part of the thread name into a `static final String` constant, e.g.`static final String PARALLEL_TASK_THREAD_NAME_PREFIX = "ParallelTaskThread";`, and reference that constant from both places?

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/TaskSequence.java
##########
@@ -340,12 +340,17 @@ private int doParallelTasks() throws Exception {
 
     initTasksArray();
     ParallelTask t[] = runningParallelTasks = new ParallelTask[repetitions * tasks.size()];
+    //Get number of parallel threads from algo file and set it to use in ReuersContentSource.java's docCountArrInit()
+    this.getRunData().getConfig().setNumThreads(t.length);
     // prepare threads
     int index = 0;
     for (int k = 0; k < repetitions; k++) {
       for (int i = 0; i < tasksArray.length; i++) {
         final PerfTask task = tasksArray[i].clone();
-        t[index++] = new ParallelTask(task);
+        t[index] = new ParallelTask(task);
+        //Setting unique ThreadName with index value which is used in ReuersContentSource.java's getNextDocData()
+        t[index].setName("IndexThread-" + index);

Review comment:
       In general, parallel tasks might be running queries too right?   Maybe we should pick a more generic name?  Maybe `ParallelTaskThread-N`?

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -100,21 +100,24 @@ public void close() throws IOException {
 
   @Override
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
-    Path f = null;
-    String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    if (docCountArrCreated == false) {
+      docCountArrInit();
     }
 
+    //Extract ThreadIndex from unique ThreadName (at position 12), which is set with '"IndexThread-"+index', in TaskSequence.java's doParallelTasks()
+    int threadIndex = Integer.parseInt(Thread.currentThread().getName().substring(12));
+    assert (threadIndex >= 0 && threadIndex < docCountArr.length):"Please check threadIndex or docCountArr length";
+    int stride = threadIndex + docCountArr[threadIndex] * docCountArr.length;
+    int inFileSize = inputFiles.size();
+
+    //Modulo Operator covers all three possible senarios i.e. 1. If inputFiles.size() < Num Of Threads 2.inputFiles.size() == Num Of Threads 3.inputFiles.size() > Num Of Threads
+    int fileIndex = stride % inFileSize;

Review comment:
       Hmm do we already guard for the (degenerate) case of `inFileSize == 0`?  If not can we add some protection here, e.g. maybe throw a clear exception that there is nothing to index?

##########
File path: lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##########
@@ -100,21 +100,24 @@ public void close() throws IOException {
 
   @Override
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, IOException {
-    Path f = null;
-    String name = null;
-    synchronized (this) {
-      if (nextFile >= inputFiles.size()) {
-        // exhausted files, start a new round, unless forever set to false.
-        if (!forever) {
-          throw new NoMoreDataException();
-        }
-        nextFile = 0;
-        iteration++;
-      }
-      f = inputFiles.get(nextFile++);
-      name = f.toRealPath() + "_" + iteration;
+    if (docCountArrCreated == false) {
+      docCountArrInit();
     }
 
+    //Extract ThreadIndex from unique ThreadName (at position 12), which is set with '"IndexThread-"+index', in TaskSequence.java's doParallelTasks()
+    int threadIndex = Integer.parseInt(Thread.currentThread().getName().substring(12));
+    assert (threadIndex >= 0 && threadIndex < docCountArr.length):"Please check threadIndex or docCountArr length";
+    int stride = threadIndex + docCountArr[threadIndex] * docCountArr.length;
+    int inFileSize = inputFiles.size();
+
+    //Modulo Operator covers all three possible senarios i.e. 1. If inputFiles.size() < Num Of Threads 2.inputFiles.size() == Num Of Threads 3.inputFiles.size() > Num Of Threads
+    int fileIndex = stride % inFileSize;
+    int iteration = stride / inFileSize;

Review comment:
       Thank you for improving this logic -- much easier to understand now!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org