You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/03/12 06:14:31 UTC

[GitHub] [hive] pkumarsinha opened a new pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

pkumarsinha opened a new pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393401308
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Can be added to a separate method

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394297682
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,17 +690,48 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
+    }
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    ReplOperationCompleteAckWork replDumpCompleteAckWork = new ReplOperationCompleteAckWork(dumpAckFile);
+    Task<ReplOperationCompleteAckWork> dumpCompleteAckWorkTask = TaskFactory.get(replDumpCompleteAckWork, conf);
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    if (!childTasks.isEmpty()) {
+      boolean ackTaskAdded = false;
+      if (taskTracker.canAddMoreTasks()) {
+        childTasks.add(dumpCompleteAckWorkTask);
+        ackTaskAdded = true;
+      }
+      if (hasMoreCopyWork() || !ackTaskAdded) {
+        DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
 
 Review comment:
   Add the ack only after no more work is pending

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395570710
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -78,18 +81,21 @@
 import java.io.IOException;
 import java.io.Serializable;
 import java.nio.charset.StandardCharsets;
+import java.util.Iterator;
 import java.util.Set;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.Base64;
-import java.util.ArrayList;
+import java.util.LinkedList;
 import java.util.UUID;
+import java.util.ArrayList;
 import java.util.concurrent.TimeUnit;
 import static org.apache.hadoop.hive.ql.exec.repl.ReplExternalTables.Writer;
 
 public class ReplDumpTask extends Task<ReplDumpWork> implements Serializable {
+  private static final long serialVersionUID = 1L;
 
 Review comment:
   tablesForBootstrap can be private 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393506687
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393723340
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +438,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
+        .verifyResults(new String[] {})
 
 Review comment:
   The copy was happening in repl load, change was prior to that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395953009
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -78,18 +81,21 @@
 import java.io.IOException;
 import java.io.Serializable;
 import java.nio.charset.StandardCharsets;
+import java.util.Iterator;
 import java.util.Set;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.Base64;
-import java.util.ArrayList;
+import java.util.LinkedList;
 import java.util.UUID;
+import java.util.ArrayList;
 import java.util.concurrent.TimeUnit;
 import static org.apache.hadoop.hive.ql.exec.repl.ReplExternalTables.Writer;
 
 public class ReplDumpTask extends Task<ReplDumpWork> implements Serializable {
+  private static final long serialVersionUID = 1L;
 
 Review comment:
   done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395573595
 
 

 ##########
 File path: ql/src/test/org/apache/hadoop/hive/ql/exec/repl/TestReplDumpTask.java
 ##########
 @@ -127,13 +128,15 @@ public void removeDBPropertyToPreventRenameWhenBootstrapDumpOfTableFails() throw
       private int tableDumpCount = 0;
 
       @Override
-      void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot, Path replDataDir,
-                     long lastReplId, Hive hiveDb, HiveWrapper.Tuple<Table> tuple)
+      List<EximUtil.ReplPathMapping> dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
+                                               Path replDataDir, long lastReplId, Hive hiveDb,
+                                               HiveWrapper.Tuple<Table> tuple)
           throws Exception {
         tableDumpCount++;
         if (tableDumpCount > 1) {
           throw new TestException();
         }
+        return Collections.EMPTY_LIST;
 
 Review comment:
   better to use Collections.emptyList() here and other places, the generic cast would  be provided by this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395952920
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,33 +128,38 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
-      Path previousHiveDumpPath =
-              previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
-      //If no previous dump is present or previous dump was loaded, proceed
-      if (shouldDump(previousHiveDumpPath)) {
-        Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-        Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-        DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-        // Initialize ReplChangeManager instance since we will require it to encode file URI.
-        ReplChangeManager.getInstance(conf);
-        Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-        Long lastReplId;
-        if (previousHiveDumpPath == null) {
-          lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      if (work.dirCopyIteratorInitialized() || work.replPathIteratorInitialized()) {
+        intitiateDataCopyTasks();
+      } else {
+        Hive hiveDb = getHive();
+        Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
+                Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
+                        .getBytes(StandardCharsets.UTF_8.name())));
+        Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
+        Path previousHiveDumpPath =
+                previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
+        //If no previous dump is present or previous dump was loaded, proceed
+        if (shouldDump(previousHiveDumpPath)) {
+          Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
+          Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
+          DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
+          // Initialize ReplChangeManager instance since we will require it to encode file URI.
+          ReplChangeManager.getInstance(conf);
+          Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
+          Long lastReplId;
+          if (previousHiveDumpPath == null) {
+            lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          } else {
+            work.setEventFrom(getEventFromPreviousDumpMetadata(previousHiveDumpPath));
+            lastReplId = incrementalDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          }
+          work.setResultValues(Arrays.asList(currentDumpPath.toUri().toString(), String.valueOf(lastReplId)));
+          deleteAllPreviousDumpMeta(dumpRoot, currentDumpPath);
+          work.setDumpAckFile(new Path(hiveDumpRoot, ReplUtils.DUMP_ACKNOWLEDGEMENT));
+          intitiateDataCopyTasks();
         } else {
-          work.setEventFrom(getEventFromPreviousDumpMetadata(previousHiveDumpPath));
-          lastReplId = incrementalDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          LOG.warn("Previous Dump is not yet loaded");
         }
-        prepareReturnValues(Arrays.asList(currentDumpPath.toUri().toString(), String.valueOf(lastReplId)));
-        writeDumpCompleteAck(hiveDumpRoot);
-        deleteAllPreviousDumpMeta(dumpRoot, currentDumpPath);
-      } else {
-        LOG.warn("Previous Dump is not yet loaded");
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393660921
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -713,9 +732,11 @@ public void testExternalTableDataPath() throws Exception {
 
   @Test
   public void testExternalTablesIncReplicationWithConcurrentDropTable() throws Throwable {
-    List<String> dumpWithClause = Collections.singletonList(
-            "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'"
-    );
+    List<String> dumpWithClause = Arrays.asList(
 
 Review comment:
   The new method is still not used here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393723769
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Yes, I am not planning to refactor all the places.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394307963
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -78,22 +81,26 @@
 import java.io.IOException;
 import java.io.Serializable;
 import java.nio.charset.StandardCharsets;
+import java.util.Iterator;
 import java.util.Set;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.Base64;
-import java.util.ArrayList;
+import java.util.LinkedList;
 import java.util.UUID;
+import java.util.ArrayList;
 import java.util.concurrent.TimeUnit;
 import static org.apache.hadoop.hive.ql.exec.repl.ReplExternalTables.Writer;
 
 public class ReplDumpTask extends Task<ReplDumpWork> implements Serializable {
+  private static final long serialVersionUID = 1L;
   private static final String dumpSchema = "dump_dir,last_repl_id#string,string";
   private static final String FUNCTION_METADATA_FILE_NAME = EximUtil.METADATA_NAME;
   private static final long SLEEP_TIME = 60000;
   Set<String> tablesForBootstrap = new HashSet<>();
+  private Path dumpAckFile;
 
 Review comment:
   We don't store the dumpath anywhere. Across the multiple call this info will be lost. I guess, ReplDumpWork is a better place for this info.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393402297
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,25 +127,31 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-      Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-      DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-      // Initialize ReplChangeManager instance since we will require it to encode file URI.
-      ReplChangeManager.getInstance(conf);
-      Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-      Long lastReplId;
-      if (!dumpRoot.getFileSystem(conf).exists(dumpRoot)
-              || dumpRoot.getFileSystem(conf).listStatus(dumpRoot).length == 0) {
-        lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      //First Check if external table copy work has been initialized, if so, just do that and return.
 
 Review comment:
   Please add more comments here

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393436614
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -367,13 +379,14 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
             // Dump external table locations if required.
             if (TableType.EXTERNAL_TABLE.equals(table.getTableType())
                   && shouldDumpExternalTableLocation()) {
-              writer.dataLocationDump(table);
+              extTableLocations.addAll(writer.dataLocationDump(table));
             }
 
             // Dump the table to be bootstrapped if required.
             if (shouldBootstrapDumpTable(table)) {
               HiveWrapper.Tuple<Table> tableTuple = new HiveWrapper(hiveDb, dbName).table(table);
-              dumpTable(dbName, tableName, validTxnList, dbRoot, dumpRoot, bootDumpBeginReplId, hiveDb, tableTuple);
+              replPathMappings.addAll(dumpTable(dbName, tableName, validTxnList, dbRoot, dumpRoot, bootDumpBeginReplId,
 
 Review comment:
   That's the existing issue with the way current iterators are being used and the patch doesn't address that. May be later, we may want to solve that in general. May be by using some disk based iterators. Will file a tracking JIRA for that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393463008
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,25 +127,31 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-      Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-      DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-      // Initialize ReplChangeManager instance since we will require it to encode file URI.
-      ReplChangeManager.getInstance(conf);
-      Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-      Long lastReplId;
-      if (!dumpRoot.getFileSystem(conf).exists(dumpRoot)
-              || dumpRoot.getFileSystem(conf).listStatus(dumpRoot).length == 0) {
-        lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      //First Check if external table copy work has been initialized, if so, just do that and return.
+      if (work.dirCopyIteratorInitialized() || work.replPathIteratorInitialized()) {
+        intitiateDataCopyTasks();
 
 Review comment:
   We would like to have a control on how many tasks in parallel can run including external table copy. That was the existing behavior even in case of external table copy.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393463746
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationOnHDFSEncryptedZones.java
 ##########
 @@ -102,12 +104,20 @@ public void targetAndSourceHaveDifferentEncryptionZoneKeys() throws Throwable {
           put(HiveConf.ConfVars.REPLDIR.varname, primary.repldDir);
         }}, "test_key123");
 
+    List<String> dumpWithClause = Arrays.asList(
+            "'hive.repl.add.raw.reserved.namespace'='true'",
+            "'" + HiveConf.ConfVars.REPL_EXTERNAL_TABLE_BASE_DIR.varname + "'='"
+                    + replica.externalTableWarehouseRoot + "'",
+            "'distcp.options.skipcrccheck'=''",
+            "'" + HiveConf.ConfVars.HIVE_SERVER2_ENABLE_DOAS.varname + "'='false'",
+            "'" + HiveConf.ConfVars.HIVE_DISTCP_DOAS_USER.varname + "'='"
+                    + UserGroupInformation.getCurrentUser().getUserName() +"'");
 
 Review comment:
   Please add a test with scheduler as well

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395952861
 
 

 ##########
 File path: ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/StageType.java
 ##########
 @@ -30,7 +30,7 @@
   REPL_TXN(15),
   REPL_INCREMENTAL_LOAD(16),
   SCHEDULED_QUERY_MAINT(17),
-  REPL_LOAD_COMPLETE_ACK(18);
+  REPL_OPERATION_COMPLETE_ACK(18);
 
   private final int value;
 
 
 Review comment:
   Removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395952745
 
 

 ##########
 File path: ql/src/test/org/apache/hadoop/hive/ql/exec/repl/TestReplDumpTask.java
 ##########
 @@ -127,13 +128,15 @@ public void removeDBPropertyToPreventRenameWhenBootstrapDumpOfTableFails() throw
       private int tableDumpCount = 0;
 
       @Override
-      void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot, Path replDataDir,
-                     long lastReplId, Hive hiveDb, HiveWrapper.Tuple<Table> tuple)
+      List<EximUtil.ReplPathMapping> dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
+                                               Path replDataDir, long lastReplId, Hive hiveDb,
+                                               HiveWrapper.Tuple<Table> tuple)
           throws Exception {
         tableDumpCount++;
         if (tableDumpCount > 1) {
           throw new TestException();
         }
+        return Collections.EMPTY_LIST;
 
 Review comment:
   done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393399236
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationOnHDFSEncryptedZones.java
 ##########
 @@ -102,12 +104,20 @@ public void targetAndSourceHaveDifferentEncryptionZoneKeys() throws Throwable {
           put(HiveConf.ConfVars.REPLDIR.varname, primary.repldDir);
         }}, "test_key123");
 
+    List<String> dumpWithClause = Arrays.asList(
 
 Review comment:
   Why is this needed?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393762700
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +438,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
+        .verifyResults(new String[] {})
 
 Review comment:
   Because copy is happening during load.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395584502
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -404,7 +417,8 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
       }
 
       Path dbRoot = getBootstrapDbRoot(dumpRoot, dbName, true);
-
+      List<Path> extTableLocations = new LinkedList<>();
+      List<EximUtil.ReplPathMapping> replPathMappings = new ArrayList<>();
 
 Review comment:
   why is this repl path mapping ? isnt this managed table data only ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393658865
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationOnHDFSEncryptedZones.java
 ##########
 @@ -102,12 +104,20 @@ public void targetAndSourceHaveDifferentEncryptionZoneKeys() throws Throwable {
           put(HiveConf.ConfVars.REPLDIR.varname, primary.repldDir);
         }}, "test_key123");
 
+    List<String> dumpWithClause = Arrays.asList(
+            "'hive.repl.add.raw.reserved.namespace'='true'",
+            "'" + HiveConf.ConfVars.REPL_EXTERNAL_TABLE_BASE_DIR.varname + "'='"
+                    + replica.externalTableWarehouseRoot + "'",
+            "'distcp.options.skipcrccheck'=''",
 
 Review comment:
   Why do we need these extra configs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] maheshk114 commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
maheshk114 commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r392815969
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -386,8 +399,10 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
         }
       }
       dumpTableListToDumpLocation(tableList, dumpRoot, dbName, conf);
+      List<ExternalTableCopyTaskBuilder.DirCopyWork> extTableCopyWorks = dirLocationsToCopy(extTableLocations);
 
 Review comment:
   if extTableLocations is empty ..no need to create the task

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394308202
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestScheduledReplicationScenarios.java
 ##########
 @@ -151,6 +151,70 @@ public void testAcidTablesReplLoadBootstrapIncr() throws Throwable {
               .run("drop table t1");
 
 
+    } finally {
+      primary.run("drop scheduled query s1");
+      replica.run("drop scheduled query s2");
+    }
+  }
+
+  @Test
+  public void testExternalTablesReplLoadBootstrapIncr() throws Throwable {
+    // Bootstrap
+    primary.run("use " + primaryDbName)
+            .run("create external table t1 (id int)")
+            .run("insert into t1 values(1)")
+            .run("insert into t1 values(2)");
+    try (ScheduledQueryExecutionService schqS =
+                 ScheduledQueryExecutionService.startScheduledQueryExecutorService(primary.hiveConf)) {
+      int next = 0;
+      ReplDumpWork.injectNextDumpDirForTest(String.valueOf(next));
+      primary.run("create scheduled query s1 every 10 minutes as repl dump " + primaryDbName);
+      primary.run("alter scheduled query s1 execute");
+      Thread.sleep(20000);
 
 Review comment:
   Ok

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394299142
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -78,22 +81,26 @@
 import java.io.IOException;
 import java.io.Serializable;
 import java.nio.charset.StandardCharsets;
+import java.util.Iterator;
 import java.util.Set;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.Base64;
-import java.util.ArrayList;
+import java.util.LinkedList;
 import java.util.UUID;
+import java.util.ArrayList;
 import java.util.concurrent.TimeUnit;
 import static org.apache.hadoop.hive.ql.exec.repl.ReplExternalTables.Writer;
 
 public class ReplDumpTask extends Task<ReplDumpWork> implements Serializable {
+  private static final long serialVersionUID = 1L;
   private static final String dumpSchema = "dump_dir,last_repl_id#string,string";
   private static final String FUNCTION_METADATA_FILE_NAME = EximUtil.METADATA_NAME;
   private static final long SLEEP_TIME = 60000;
   Set<String> tablesForBootstrap = new HashSet<>();
+  private Path dumpAckFile;
 
 Review comment:
   This needn't be a member variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393436940
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Will see if we can do that. Problem was that the parameters were not fixed and were being set on case-to-case basis.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393506405
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationOnHDFSEncryptedZones.java
 ##########
 @@ -102,12 +104,20 @@ public void targetAndSourceHaveDifferentEncryptionZoneKeys() throws Throwable {
           put(HiveConf.ConfVars.REPLDIR.varname, primary.repldDir);
         }}, "test_key123");
 
+    List<String> dumpWithClause = Arrays.asList(
+            "'hive.repl.add.raw.reserved.namespace'='true'",
+            "'" + HiveConf.ConfVars.REPL_EXTERNAL_TABLE_BASE_DIR.varname + "'='"
+                    + replica.externalTableWarehouseRoot + "'",
+            "'distcp.options.skipcrccheck'=''",
+            "'" + HiveConf.ConfVars.HIVE_SERVER2_ENABLE_DOAS.varname + "'='false'",
+            "'" + HiveConf.ConfVars.HIVE_DISTCP_DOAS_USER.varname + "'='"
+                    + UserGroupInformation.getCurrentUser().getUserName() +"'");
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393436153
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,25 +127,31 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-      Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-      DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-      // Initialize ReplChangeManager instance since we will require it to encode file URI.
-      ReplChangeManager.getInstance(conf);
-      Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-      Long lastReplId;
-      if (!dumpRoot.getFileSystem(conf).exists(dumpRoot)
-              || dumpRoot.getFileSystem(conf).listStatus(dumpRoot).length == 0) {
-        lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      //First Check if external table copy work has been initialized, if so, just do that and return.
 
 Review comment:
   Will do

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395582361
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,33 +128,38 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
-      Path previousHiveDumpPath =
-              previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
-      //If no previous dump is present or previous dump was loaded, proceed
-      if (shouldDump(previousHiveDumpPath)) {
-        Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-        Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-        DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-        // Initialize ReplChangeManager instance since we will require it to encode file URI.
-        ReplChangeManager.getInstance(conf);
-        Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-        Long lastReplId;
-        if (previousHiveDumpPath == null) {
-          lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      if (work.dirCopyIteratorInitialized() || work.replPathIteratorInitialized()) {
+        intitiateDataCopyTasks();
+      } else {
+        Hive hiveDb = getHive();
+        Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
+                Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
+                        .getBytes(StandardCharsets.UTF_8.name())));
+        Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
+        Path previousHiveDumpPath =
+                previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
+        //If no previous dump is present or previous dump was loaded, proceed
+        if (shouldDump(previousHiveDumpPath)) {
+          Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
+          Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
+          DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
+          // Initialize ReplChangeManager instance since we will require it to encode file URI.
+          ReplChangeManager.getInstance(conf);
+          Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
+          Long lastReplId;
+          if (previousHiveDumpPath == null) {
+            lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          } else {
+            work.setEventFrom(getEventFromPreviousDumpMetadata(previousHiveDumpPath));
+            lastReplId = incrementalDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          }
+          work.setResultValues(Arrays.asList(currentDumpPath.toUri().toString(), String.valueOf(lastReplId)));
+          deleteAllPreviousDumpMeta(dumpRoot, currentDumpPath);
+          work.setDumpAckFile(new Path(hiveDumpRoot, ReplUtils.DUMP_ACKNOWLEDGEMENT));
+          intitiateDataCopyTasks();
         } else {
-          work.setEventFrom(getEventFromPreviousDumpMetadata(previousHiveDumpPath));
-          lastReplId = incrementalDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          LOG.warn("Previous Dump is not yet loaded");
         }
-        prepareReturnValues(Arrays.asList(currentDumpPath.toUri().toString(), String.valueOf(lastReplId)));
-        writeDumpCompleteAck(hiveDumpRoot);
-        deleteAllPreviousDumpMeta(dumpRoot, currentDumpPath);
-      } else {
-        LOG.warn("Previous Dump is not yet loaded");
 
 Review comment:
   this is not warn its info 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393660304
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +438,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
+        .verifyResults(new String[] {})
 
 Review comment:
   How was this getting loaded in the older scenario

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394307271
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,17 +690,48 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
+    }
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    ReplOperationCompleteAckWork replDumpCompleteAckWork = new ReplOperationCompleteAckWork(dumpAckFile);
+    Task<ReplOperationCompleteAckWork> dumpCompleteAckWorkTask = TaskFactory.get(replDumpCompleteAckWork, conf);
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    if (!childTasks.isEmpty()) {
+      boolean ackTaskAdded = false;
+      if (taskTracker.canAddMoreTasks()) {
+        childTasks.add(dumpCompleteAckWorkTask);
+        ackTaskAdded = true;
+      }
+      if (hasMoreCopyWork() || !ackTaskAdded) {
+        DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
 
 Review comment:
   Ack is being added at the end in these tasks as well

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394492450
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -193,20 +193,29 @@ private Long getEventFromPreviousDumpMetadata(Path previousDumpPath) throws Sema
   }
 
   private Path getPreviousDumpMetadataPath(Path dumpRoot) throws IOException {
+    FileStatus latestUpdatedStatus = null;
     FileSystem fs = dumpRoot.getFileSystem(conf);
     if (fs.exists(dumpRoot)) {
       FileStatus[] statuses = fs.listStatus(dumpRoot);
       if (statuses.length > 0)  {
 
 Review comment:
   It will return empty array

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395952774
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplOperationCompleteAckWork.java
 ##########
 @@ -18,27 +18,28 @@
 
 package org.apache.hadoop.hive.ql.exec.repl;
 
+import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hive.ql.plan.Explain;
 import org.apache.hadoop.hive.ql.plan.Explain.Level;
 
 import java.io.Serializable;
 
 /**
- * ReplLoadCompleteAckWork.
- * FS based Acknowledgement for repl load complete
+ * ReplOperationCompleteAckWork.
+ * FS based acknowledgement on repl dump and repl load completion.
  *
  */
-@Explain(displayName = "Repl Load Complete Ack", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED })
-public class ReplLoadCompleteAckWork implements Serializable {
+@Explain(displayName = "Repl Operation Complete Ack", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393462313
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -386,8 +399,10 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
         }
       }
       dumpTableListToDumpLocation(tableList, dumpRoot, dbName, conf);
+      List<ExternalTableCopyTaskBuilder.DirCopyWork> extTableCopyWorks = dirLocationsToCopy(extTableLocations);
 
 Review comment:
   Sure, will fix that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] maheshk114 commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
maheshk114 commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r392815132
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,25 +127,31 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-      Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-      DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-      // Initialize ReplChangeManager instance since we will require it to encode file URI.
-      ReplChangeManager.getInstance(conf);
-      Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-      Long lastReplId;
-      if (!dumpRoot.getFileSystem(conf).exists(dumpRoot)
-              || dumpRoot.getFileSystem(conf).listStatus(dumpRoot).length == 0) {
-        lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      //First Check if external table copy work has been initialized, if so, just do that and return.
+      if (work.dirCopyIteratorInitialized() || work.replPathIteratorInitialized()) {
+        intitiateDataCopyTasks();
 
 Review comment:
   why cant external table copy task goes in parallel to other tasks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395606610
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
 
 Review comment:
   ReplPathMapping.tasks() is sort of util method. Ideally there should be a builder class like we have for dirCopyTasks: ExternalTableCopyTaskBuilder.
   In this case full fledged builder wasn't required. I can add a new Builder class for the same.
   May be, we can shift it to ReplCopyTask as a static method as ultimately it is giving back ReplCopyTasks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395570139
 
 

 ##########
 File path: ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/StageType.java
 ##########
 @@ -30,7 +30,7 @@
   REPL_TXN(15),
   REPL_INCREMENTAL_LOAD(16),
   SCHEDULED_QUERY_MAINT(17),
-  REPL_LOAD_COMPLETE_ACK(18);
+  REPL_OPERATION_COMPLETE_ACK(18);
 
   private final int value;
 
 
 Review comment:
   unused imports like HashMap, Util
   remove the private qualifier from constructor.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393400475
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -713,9 +732,11 @@ public void testExternalTableDataPath() throws Exception {
 
   @Test
   public void testExternalTablesIncReplicationWithConcurrentDropTable() throws Throwable {
-    List<String> dumpWithClause = Collections.singletonList(
-            "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'"
-    );
+    List<String> dumpWithClause = Arrays.asList(
 
 Review comment:
   Can be separated in a common method

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395579429
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
+    }
 
 Review comment:
   why not just have just two statements like, with some return values added to child tasks.
   
       work.externalTableCopyTasks(taskTracker, conf);
       work.managedTableCopyTasks()
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395577292
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplOperationCompleteAckWork.java
 ##########
 @@ -18,27 +18,28 @@
 
 package org.apache.hadoop.hive.ql.exec.repl;
 
+import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hive.ql.plan.Explain;
 import org.apache.hadoop.hive.ql.plan.Explain.Level;
 
 import java.io.Serializable;
 
 /**
- * ReplLoadCompleteAckWork.
- * FS based Acknowledgement for repl load complete
+ * ReplOperationCompleteAckWork.
+ * FS based acknowledgement on repl dump and repl load completion.
  *
  */
-@Explain(displayName = "Repl Load Complete Ack", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED })
-public class ReplLoadCompleteAckWork implements Serializable {
+@Explain(displayName = "Repl Operation Complete Ack", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED })
 
 Review comment:
   Just have the class as AckWork.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394307271
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,17 +690,48 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
+    }
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    ReplOperationCompleteAckWork replDumpCompleteAckWork = new ReplOperationCompleteAckWork(dumpAckFile);
+    Task<ReplOperationCompleteAckWork> dumpCompleteAckWorkTask = TaskFactory.get(replDumpCompleteAckWork, conf);
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    if (!childTasks.isEmpty()) {
+      boolean ackTaskAdded = false;
+      if (taskTracker.canAddMoreTasks()) {
+        childTasks.add(dumpCompleteAckWorkTask);
+        ackTaskAdded = true;
+      }
+      if (hasMoreCopyWork() || !ackTaskAdded) {
+        DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
 
 Review comment:
   Ack is being added at the end in these cases as well

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393724555
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationOnHDFSEncryptedZones.java
 ##########
 @@ -102,12 +104,20 @@ public void targetAndSourceHaveDifferentEncryptionZoneKeys() throws Throwable {
           put(HiveConf.ConfVars.REPLDIR.varname, primary.repldDir);
         }}, "test_key123");
 
+    List<String> dumpWithClause = Arrays.asList(
+            "'hive.repl.add.raw.reserved.namespace'='true'",
+            "'" + HiveConf.ConfVars.REPL_EXTERNAL_TABLE_BASE_DIR.varname + "'='"
+                    + replica.externalTableWarehouseRoot + "'",
+            "'distcp.options.skipcrccheck'=''",
 
 Review comment:
   That was part of existing test, I have used the same set of config in dump time

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393432287
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -397,17 +397,19 @@ public void externalTableWithPartitions() throws Throwable {
     primary.run("use " + primaryDbName)
             .run("insert into table t2 partition(country='france') values ('lyon')")
             .run("alter table t2 set location '" + tmpLocation2 + "'")
-            .dump(primaryDbName);
+            .dump(primaryDbName, withClause);
 
-    replica.load(replicatedDbName, primaryDbName, loadWithClause);
+    replica.load(replicatedDbName, primaryDbName, withClause);
     assertTablePartitionLocation(primaryDbName + ".t2", replicatedDbName + ".t2");
   }
 
   @Test
   public void externalTableIncrementalReplication() throws Throwable {
-    WarehouseInstance.Tuple tuple = primary.dumpWithCommand("repl dump " + primaryDbName);
+    List<String> withClause = externalTableBasePathWithClause();
+    String replDumpCommand = "repl dump " + primaryDbName
+            + " WITH (" + withClause.get(0) + "," + withClause.get(1) + ")";
+    WarehouseInstance.Tuple tuple = primary.dumpWithCommand(replDumpCommand);
 
 Review comment:
   Will change

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395952912
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,33 +128,38 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
-      Path previousHiveDumpPath =
-              previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
-      //If no previous dump is present or previous dump was loaded, proceed
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393400307
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +438,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
+        .verifyResults(new String[] {})
 
 Review comment:
   Why is there no data here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393433210
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +438,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
+        .verifyResults(new String[] {})
 
 Review comment:
   Since the modification happened after dump, data will be seen only after next dump-load cycle, which is being tested.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394885186
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -193,20 +193,29 @@ private Long getEventFromPreviousDumpMetadata(Path previousDumpPath) throws Sema
   }
 
   private Path getPreviousDumpMetadataPath(Path dumpRoot) throws IOException {
+    FileStatus latestUpdatedStatus = null;
     FileSystem fs = dumpRoot.getFileSystem(conf);
     if (fs.exists(dumpRoot)) {
       FileStatus[] statuses = fs.listStatus(dumpRoot);
       if (statuses.length > 0)  {
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395581381
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,33 +128,38 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
-      Path previousHiveDumpPath =
-              previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
-      //If no previous dump is present or previous dump was loaded, proceed
 
 Review comment:
   why not move the above statement in the function before  this 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395603150
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
+    }
+    if (!childTasks.isEmpty()) {
+      DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
+    } else {
+      prepareReturnValues(work.getResultValues());
+      childTasks.add(TaskFactory.get(new ReplOperationCompleteAckWork(work.getDumpAckFile()), conf));
     }
 
 Review comment:
   Yes, in my previous patch, I was adding the ack work as child task whenever it was possible by making use of tracker.canAddMoreTasks . Then the requirement came to run this prepareReturnValues(); as the last operation just before ack task.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393427998
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationOnHDFSEncryptedZones.java
 ##########
 @@ -102,12 +104,20 @@ public void targetAndSourceHaveDifferentEncryptionZoneKeys() throws Throwable {
           put(HiveConf.ConfVars.REPLDIR.varname, primary.repldDir);
         }}, "test_key123");
 
+    List<String> dumpWithClause = Arrays.asList(
 
 Review comment:
   Since the external table copy is happening on source, the configs are required to be passed at dump.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393662263
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Its still using its own configs. Is this done?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393660435
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +436,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
 
 Review comment:
   How was it getting loaded here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393402843
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -367,13 +379,14 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
             // Dump external table locations if required.
             if (TableType.EXTERNAL_TABLE.equals(table.getTableType())
                   && shouldDumpExternalTableLocation()) {
-              writer.dataLocationDump(table);
+              extTableLocations.addAll(writer.dataLocationDump(table));
             }
 
             // Dump the table to be bootstrapped if required.
             if (shouldBootstrapDumpTable(table)) {
               HiveWrapper.Tuple<Table> tableTuple = new HiveWrapper(hiveDb, dbName).table(table);
-              dumpTable(dbName, tableName, validTxnList, dbRoot, dumpRoot, bootDumpBeginReplId, hiveDb, tableTuple);
+              replPathMappings.addAll(dumpTable(dbName, tableName, validTxnList, dbRoot, dumpRoot, bootDumpBeginReplId,
 
 Review comment:
   If there are too many external tables OOM?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393723769
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Yes, I was not planning to refactor all the places.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395584857
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
 
 Review comment:
   why is this static method here, there is nothing accessed from there why not just in the current class. is some encapsulation broken here ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395953040
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
+    }
+    if (!childTasks.isEmpty()) {
+      DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
+    } else {
+      prepareReturnValues(work.getResultValues());
+      childTasks.add(TaskFactory.get(new ReplOperationCompleteAckWork(work.getDumpAckFile()), conf));
     }
 
 Review comment:
   Have refactored a bit. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394291913
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestScheduledReplicationScenarios.java
 ##########
 @@ -151,6 +151,70 @@ public void testAcidTablesReplLoadBootstrapIncr() throws Throwable {
               .run("drop table t1");
 
 
+    } finally {
+      primary.run("drop scheduled query s1");
+      replica.run("drop scheduled query s2");
+    }
+  }
+
+  @Test
+  public void testExternalTablesReplLoadBootstrapIncr() throws Throwable {
+    // Bootstrap
+    primary.run("use " + primaryDbName)
+            .run("create external table t1 (id int)")
+            .run("insert into t1 values(1)")
+            .run("insert into t1 values(2)");
+    try (ScheduledQueryExecutionService schqS =
+                 ScheduledQueryExecutionService.startScheduledQueryExecutorService(primary.hiveConf)) {
+      int next = 0;
+      ReplDumpWork.injectNextDumpDirForTest(String.valueOf(next));
+      primary.run("create scheduled query s1 every 10 minutes as repl dump " + primaryDbName);
+      primary.run("alter scheduled query s1 execute");
+      Thread.sleep(20000);
 
 Review comment:
   You might need to increase this time. Dump is not getting completed

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395603150
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
+    }
+    if (!childTasks.isEmpty()) {
+      DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
+    } else {
+      prepareReturnValues(work.getResultValues());
+      childTasks.add(TaskFactory.get(new ReplOperationCompleteAckWork(work.getDumpAckFile()), conf));
     }
 
 Review comment:
   Yes, in my previous patch, I was adding the ack work as child task whenever it was possible by making use of tracker.canAddMoreTasks . Then the requirement came to run this prepareReturnValues(); as the last operation just before ack task.
   Until prepareReturnValues() can be run as separate task, we can't add it as a future operation. There is no generic task we have for ad-hoc operation like this and I thought creating a new one would be overkill. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395586086
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
+    }
+    if (!childTasks.isEmpty()) {
+      DAGTraversal.traverse(childTasks, new AddDependencyToLeaves(TaskFactory.get(work, conf)));
+    } else {
+      prepareReturnValues(work.getResultValues());
+      childTasks.add(TaskFactory.get(new ReplOperationCompleteAckWork(work.getDumpAckFile()), conf));
     }
 
 Review comment:
   why is ack work only if the child task is empty, there is possibility of adding the ack work at the end of all the dag, if the dag can contain more nodes. may be need to move it out of else clause and do it based on tracker.canAddMoreTasks or something?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394464144
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -193,20 +193,29 @@ private Long getEventFromPreviousDumpMetadata(Path previousDumpPath) throws Sema
   }
 
   private Path getPreviousDumpMetadataPath(Path dumpRoot) throws IOException {
+    FileStatus latestUpdatedStatus = null;
     FileSystem fs = dumpRoot.getFileSystem(conf);
     if (fs.exists(dumpRoot)) {
       FileStatus[] statuses = fs.listStatus(dumpRoot);
       if (statuses.length > 0)  {
 
 Review comment:
   Is fs.listStatus() guaranteed to be not null even when the the path does not have any folder/files?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393399950
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -397,17 +397,19 @@ public void externalTableWithPartitions() throws Throwable {
     primary.run("use " + primaryDbName)
             .run("insert into table t2 partition(country='france') values ('lyon')")
             .run("alter table t2 set location '" + tmpLocation2 + "'")
-            .dump(primaryDbName);
+            .dump(primaryDbName, withClause);
 
-    replica.load(replicatedDbName, primaryDbName, loadWithClause);
+    replica.load(replicatedDbName, primaryDbName, withClause);
     assertTablePartitionLocation(primaryDbName + ".t2", replicatedDbName + ".t2");
   }
 
   @Test
   public void externalTableIncrementalReplication() throws Throwable {
-    WarehouseInstance.Tuple tuple = primary.dumpWithCommand("repl dump " + primaryDbName);
+    List<String> withClause = externalTableBasePathWithClause();
+    String replDumpCommand = "repl dump " + primaryDbName
+            + " WITH (" + withClause.get(0) + "," + withClause.get(1) + ")";
+    WarehouseInstance.Tuple tuple = primary.dumpWithCommand(replDumpCommand);
 
 Review comment:
   can use withClause directly

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394459376
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -193,20 +193,29 @@ private Long getEventFromPreviousDumpMetadata(Path previousDumpPath) throws Sema
   }
 
   private Path getPreviousDumpMetadataPath(Path dumpRoot) throws IOException {
+    FileStatus latestUpdatedStatus = null;
     FileSystem fs = dumpRoot.getFileSystem(conf);
     if (fs.exists(dumpRoot)) {
       FileStatus[] statuses = fs.listStatus(dumpRoot);
       if (statuses.length > 0)  {
 
 Review comment:
   We don't need this check

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393723769
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestTableLevelReplicationScenarios.java
 ##########
 @@ -918,7 +922,9 @@ public void testRenameTableScenariosWithReplaceExternalTable() throws Throwable
     String newPolicy = primaryDbName + ".'(in[0-9]+)|(out1500)|(in2)'";
     dumpWithClause = Arrays.asList(
             "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
-            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'"
+            "'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='false'",
 
 Review comment:
   Missed it here, will fix

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393506609
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -713,9 +732,11 @@ public void testExternalTableDataPath() throws Exception {
 
   @Test
   public void testExternalTablesIncReplicationWithConcurrentDropTable() throws Throwable {
-    List<String> dumpWithClause = Collections.singletonList(
-            "'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'"
-    );
+    List<String> dumpWithClause = Arrays.asList(
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
aasha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r394293829
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,33 +129,38 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
-      Path previousHiveDumpPath =
-              previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
-      //If no previous dump is present or previous dump was loaded, proceed
-      if (shouldDump(previousHiveDumpPath)) {
-        Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-        Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-        DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-        // Initialize ReplChangeManager instance since we will require it to encode file URI.
-        ReplChangeManager.getInstance(conf);
-        Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-        Long lastReplId;
-        if (previousHiveDumpPath == null) {
-          lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      if (work.dirCopyIteratorInitialized() || work.replPathIteratorInitialized()) {
+        intitiateDataCopyTasks();
+      } else {
+        Hive hiveDb = getHive();
+        Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
+                Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
+                        .getBytes(StandardCharsets.UTF_8.name())));
+        Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
+        Path previousHiveDumpPath =
+                previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
+        //If no previous dump is present or previous dump was loaded, proceed
+        if (shouldDump(previousHiveDumpPath)) {
+          Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
+          Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
+          DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
+          // Initialize ReplChangeManager instance since we will require it to encode file URI.
+          ReplChangeManager.getInstance(conf);
+          Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
+          Long lastReplId;
+          if (previousHiveDumpPath == null) {
+            lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          } else {
+            work.setEventFrom(getEventFromPreviousDumpMetadata(previousHiveDumpPath));
+            lastReplId = incrementalDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+          }
+          prepareReturnValues(Arrays.asList(currentDumpPath.toUri().toString(), String.valueOf(lastReplId)));
+          deleteAllPreviousDumpMeta(dumpRoot, currentDumpPath);
 
 Review comment:
   This should be done at the end

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395575495
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -662,15 +696,37 @@ void dumpTable(String dbName, String tblName, String validTxnList, Path dbRoot,
     replLogger.tableLog(tblName, tableSpec.tableHandle.getTableType());
     if (tableSpec.tableHandle.getTableType().equals(TableType.EXTERNAL_TABLE)
             || Utils.shouldDumpMetaDataOnly(conf)) {
-      return;
+      return Collections.EMPTY_LIST;
     }
-    for (ReplPathMapping replPathMapping: replPathMappings) {
-      Task<?> copyTask = ReplCopyTask.getLoadCopyTask(
-              tuple.replicationSpec, replPathMapping.getSrcPath(), replPathMapping.getTargetPath(), conf, false);
-      this.addDependentTask(copyTask);
-      LOG.info("Scheduled a repl copy task from [{}] to [{}]",
-              replPathMapping.getSrcPath(), replPathMapping.getTargetPath());
+    return replPathMappings;
+  }
+
+  private void intitiateDataCopyTasks() throws SemanticException {
+    Iterator<ExternalTableCopyTaskBuilder.DirCopyWork> extCopyWorkItr = work.getDirCopyIterator();
+    List<Task<?>> childTasks = new ArrayList<>();
+    int maxTasks = conf.getIntVar(HiveConf.ConfVars.REPL_APPROX_MAX_LOAD_TASKS);
+    TaskTracker taskTracker = new TaskTracker(maxTasks);
+    while (taskTracker.canAddMoreTasks() && hasMoreCopyWork()) {
+      if (work.replPathIteratorInitialized() && extCopyWorkItr.hasNext()) {
+        childTasks.addAll(new ExternalTableCopyTaskBuilder(work, conf).tasks(taskTracker));
+      } else {
+        childTasks.addAll(ReplPathMapping.tasks(work, taskTracker, conf));
+      }
+    }
+    if (!childTasks.isEmpty()) {
 
 Review comment:
   better to not do the negate check and inverse the if else block by removing the !

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
pkumarsinha commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r393762700
 
 

 ##########
 File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##########
 @@ -436,16 +438,29 @@ public void externalTableIncrementalReplication() throws Throwable {
     }
 
     List<String> loadWithClause = externalTableBasePathWithClause();
-    replica.load(replicatedDbName, primaryDbName, loadWithClause)
+    replica.load(replicatedDbName, primaryDbName, withClause)
         .run("use " + replicatedDbName)
         .run("show tables like 't1'")
         .verifyResult("t1")
         .run("show partitions t1")
         .verifyResults(new String[] { "country=india", "country=us" })
         .run("select place from t1 order by place")
-        .verifyResults(new String[] { "bangalore", "mumbai", "pune" })
+        .verifyResults(new String[] {})
 
 Review comment:
   Because copy was happening during load.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation

Posted by GitBox <gi...@apache.org>.
anishek commented on a change in pull request #951: HIVE-22997 : Copy external table to target during Repl Dump operation
URL: https://github.com/apache/hive/pull/951#discussion_r395573044
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
 ##########
 @@ -122,33 +128,38 @@ public String getName() {
   @Override
   public int execute() {
     try {
-      Hive hiveDb = getHive();
-      Path dumpRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLDIR),
-              Base64.getEncoder().encodeToString(work.dbNameOrPattern.toLowerCase()
-                      .getBytes(StandardCharsets.UTF_8.name())));
-      Path previousDumpMetaPath = getPreviousDumpMetadataPath(dumpRoot);
-      Path previousHiveDumpPath =
-              previousDumpMetaPath != null ? new Path(previousDumpMetaPath, ReplUtils.REPL_HIVE_BASE_DIR) : null;
-      //If no previous dump is present or previous dump was loaded, proceed
-      if (shouldDump(previousHiveDumpPath)) {
-        Path currentDumpPath = new Path(dumpRoot, getNextDumpDir());
-        Path hiveDumpRoot = new Path(currentDumpPath, ReplUtils.REPL_HIVE_BASE_DIR);
-        DumpMetaData dmd = new DumpMetaData(hiveDumpRoot, conf);
-        // Initialize ReplChangeManager instance since we will require it to encode file URI.
-        ReplChangeManager.getInstance(conf);
-        Path cmRoot = new Path(conf.getVar(HiveConf.ConfVars.REPLCMDIR));
-        Long lastReplId;
-        if (previousHiveDumpPath == null) {
-          lastReplId = bootStrapDump(hiveDumpRoot, dmd, cmRoot, hiveDb);
+      if (work.dirCopyIteratorInitialized() || work.replPathIteratorInitialized()) {
+        intitiateDataCopyTasks();
 
 Review comment:
   should be initiateDataCopyTasks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org