You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/02/22 05:32:27 UTC

[GitHub] [hive] rbalamohan commented on a change in pull request #3037: HIVE-25958: Optimise BasicStatsNoJobTask.

rbalamohan commented on a change in pull request #3037:
URL: https://github.com/apache/hive/pull/3037#discussion_r811588175



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -223,6 +227,16 @@ public void run() {
         } else {
           fileList = HiveStatsUtils.getFileStatusRecurse(dir, -1, fs);
         }
+        ThreadPoolExecutor tpE = null;
+        ArrayList<Future<FileStats>> futures = null;

Review comment:
       List instead?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -223,6 +227,16 @@ public void run() {
         } else {
           fileList = HiveStatsUtils.getFileStatusRecurse(dir, -1, fs);
         }
+        ThreadPoolExecutor tpE = null;
+        ArrayList<Future<FileStats>> futures = null;
+        int numThreads = HiveConf.getIntVar(jc, HiveConf.ConfVars.BASICSTATSTASKSMAXTHREADS);
+        if (fileList.size() > 1 && numThreads > 1) {
+          numThreads = Math.max(fileList.size(), numThreads);

Review comment:
       Limit to 2*available processors? (i.e instead of fileList.size(). In case file listing is 1k, it shouldn't spin up 1k threads).

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -232,28 +246,33 @@ public void run() {
             if (file.getLen() == 0) {
               numFiles += 1;
             } else {
-              org.apache.hadoop.mapred.RecordReader<?, ?> recordReader = inputFormat.getRecordReader(dummySplit, jc, Reporter.NULL);
-              try {
-                if (recordReader instanceof StatsProvidingRecordReader) {
-                  StatsProvidingRecordReader statsRR;
-                  statsRR = (StatsProvidingRecordReader) recordReader;
-                  rawDataSize += statsRR.getStats().getRawDataSize();
-                  numRows += statsRR.getStats().getRowCount();
-                  fileSize += file.getLen();
-                  numFiles += 1;
-                  if (file.isErasureCoded()) {
-                    numErasureCodedFiles++;
-                  }
-                } else {
-                  throw new HiveException(String.format("Unexpected file found during reading footers for: %s ", file));
-                }
-              } finally {
-                recordReader.close();
+              FileStatProcessor fsp = new FileStatProcessor(file, inputFormat, dummySplit, jc);
+              if (tpE != null) {
+                futures.add(tpE.submit(fsp));

Review comment:
       Add exception handling? (e.g kill/cancel other tasks on any exception from other tasks)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org