You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/02/17 06:14:21 UTC

[GitHub] [hive] pkumarsinha commented on a change in pull request #1954: HIVE-24750. Create a single copy task for external tables within default DB location.

pkumarsinha commented on a change in pull request #1954:
URL: https://github.com/apache/hive/pull/1954#discussion_r577052902



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##########
@@ -679,6 +688,16 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
               LOG.debug(te.getMessage());
             }
           }
+          // if it is not a table level replication, add a single task for
+          // the database external location.
+          if (isExternalTablePresent && shouldDumpExternalTableLocation()
+              && isSingleTaskForExternalDb) {
+            // Using the lower case of the database name, to keep it

Review comment:
       nit: can accommodate in fewer lines

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -600,6 +600,11 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "This is the fully qualified base directory on the target/replica warehouse under which data for "
             + "external tables is stored. This is relative base path and hence prefixed to the source "
             + "external table path on target cluster."),
+    REPL_EXTERNAL_WAREHOUSE_SINGLE_TASK("hive.repl.external.warehouse.single.task",
+        false, "Should create single copy task for all the tables "
+        + "within the external database location, Would require more memory "

Review comment:
       " external database location " -> "database default location for external tables"?

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -600,6 +600,11 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "This is the fully qualified base directory on the target/replica warehouse under which data for "
             + "external tables is stored. This is relative base path and hence prefixed to the source "
             + "external table path on target cluster."),
+    REPL_EXTERNAL_WAREHOUSE_SINGLE_TASK("hive.repl.external.warehouse.single.task",

Review comment:
       hive.repl.external.warehouse.single.copy.task ?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##########
@@ -679,6 +688,16 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
               LOG.debug(te.getMessage());
             }
           }
+          // if it is not a table level replication, add a single task for
+          // the database external location.

Review comment:
       the database default location for external tables..

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplExternalTables.java
##########
@@ -149,10 +150,15 @@ void dataLocationDump(Table table, FileList fileList, HiveConf conf)
             "only External tables can be writen via this writer, provided table is " + table
                 .getTableType());
       }
-      Path fullyQualifiedDataLocation =
-          PathBuilder.fullyQualifiedHDFSUri(table.getDataLocation(), FileSystem.get(hiveConf));
-      write(lineFor(table.getTableName(), fullyQualifiedDataLocation, hiveConf));
-      dirLocationToCopy(fileList, fullyQualifiedDataLocation, conf);
+      Path fullyQualifiedDataLocation = PathBuilder
+          .fullyQualifiedHDFSUri(table.getDataLocation(),
+              FileSystem.get(hiveConf));
+      if (isTableLevelReplication || !FileUtils
+          .isPathWithinSubtree(table.getDataLocation(), dbLoc)) {
+        write(lineFor(table.getTableName(), fullyQualifiedDataLocation,
+            hiveConf));
+        dirLocationToCopy(fileList, fullyQualifiedDataLocation, conf);

Review comment:
       Is table location added for partitioned tables as well?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##########
@@ -921,6 +940,10 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb)
         String uniqueKey = Utils.setDbBootstrapDumpState(hiveDb, dbName);
         Exception caught = null;
         try (Writer writer = new Writer(dbRoot, conf)) {
+          boolean isSingleTaskForExternalDb =

Review comment:
       nit: naming can be improved to avoid confusion. isSingleCopyTaskForExternalTables ?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java
##########
@@ -654,14 +655,22 @@ private Long incrementalDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive
         Path dbRootData = new Path(bootstrapRoot, EximUtil.DATA_PATH_NAME + File.separator + dbName);
         boolean dataCopyAtLoad = conf.getBoolVar(HiveConf.ConfVars.REPL_RUN_DATA_COPY_TASKS_ON_TARGET);
         try (Writer writer = new Writer(dumpRoot, conf)) {
+          Path dbPath = null;
+          boolean isSingleTaskForExternalDb =
+              conf.getBoolVar(REPL_EXTERNAL_WAREHOUSE_SINGLE_TASK)
+                  && work.replScope.includeAllTables();
+          boolean isExternalTablePresent = false;
           for (String tableName : Utils.matchesTbl(hiveDb, dbName, work.replScope)) {
             try {
               Table table = hiveDb.getTable(dbName, tableName);
 
               // Dump external table locations if required.
               if (TableType.EXTERNAL_TABLE.equals(table.getTableType())
                       && shouldDumpExternalTableLocation()) {
-                writer.dataLocationDump(table, extTableFileList, conf);
+                dbPath = new Path(hiveDb.getDatabase(dbName).getLocationUri());
+                writer.dataLocationDump(table, extTableFileList, dbPath,

Review comment:
       Isn't data coy happening using _file_list_external and not _external_tables_info, which is due for deletion(https://issues.apache.org/jira/browse/HIVE-24718)

##########
File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
##########
@@ -1375,6 +1375,103 @@ public void differentCatalogIncrementalReplication() throws Throwable {
     primary.run("drop database if exists " + sparkDbName + " cascade");
   }
 
+  @Test
+  public void testDatabaseLevelCopy() throws Throwable {
+    Path externalTableLocation =
+        new Path("/" + testName.getMethodName() + "/" + primaryDbName + "/" + "a/");
+    DistributedFileSystem fs = primary.miniDFSCluster.getFileSystem();
+    fs.mkdirs(externalTableLocation, new FsPermission("777"));
+
+    Path externalTablePartitionLocation =
+        new Path("/" + testName.getMethodName() + "/" + primaryDbName + "/" + "part/");
+    fs.mkdirs(externalTableLocation, new FsPermission("777"));
+
+    List<String> withClause = Arrays.asList(
+        "'distcp.options.update'='','hive.repl.external.warehouse.single"
+            + ".task'='true'");

Review comment:
       Add a test with this config disabled as well

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplExternalTables.java
##########
@@ -149,10 +150,15 @@ void dataLocationDump(Table table, FileList fileList, HiveConf conf)
             "only External tables can be writen via this writer, provided table is " + table
                 .getTableType());
       }
-      Path fullyQualifiedDataLocation =
-          PathBuilder.fullyQualifiedHDFSUri(table.getDataLocation(), FileSystem.get(hiveConf));
-      write(lineFor(table.getTableName(), fullyQualifiedDataLocation, hiveConf));
-      dirLocationToCopy(fileList, fullyQualifiedDataLocation, conf);
+      Path fullyQualifiedDataLocation = PathBuilder
+          .fullyQualifiedHDFSUri(table.getDataLocation(),
+              FileSystem.get(hiveConf));
+      if (isTableLevelReplication || !FileUtils
+          .isPathWithinSubtree(table.getDataLocation(), dbLoc)) {
+        write(lineFor(table.getTableName(), fullyQualifiedDataLocation,
+            hiveConf));
+        dirLocationToCopy(fileList, fullyQualifiedDataLocation, conf);

Review comment:
       If the intention is to attempt table level single copy for partitioned tables  as well then that should also be protected by config 'REPL_EXTERNAL_WAREHOUSE_SINGLE_TASK' .

##########
File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
##########
@@ -1375,6 +1375,103 @@ public void differentCatalogIncrementalReplication() throws Throwable {
     primary.run("drop database if exists " + sparkDbName + " cascade");
   }
 
+  @Test
+  public void testDatabaseLevelCopy() throws Throwable {
+    Path externalTableLocation =
+        new Path("/" + testName.getMethodName() + "/" + primaryDbName + "/" + "a/");
+    DistributedFileSystem fs = primary.miniDFSCluster.getFileSystem();
+    fs.mkdirs(externalTableLocation, new FsPermission("777"));
+
+    Path externalTablePartitionLocation =
+        new Path("/" + testName.getMethodName() + "/" + primaryDbName + "/" + "part/");
+    fs.mkdirs(externalTableLocation, new FsPermission("777"));
+
+    List<String> withClause = Arrays.asList(
+        "'distcp.options.update'='','hive.repl.external.warehouse.single"
+            + ".task'='true'");
+
+    // Create a table within the warehouse location, one outside and one with
+    // a partition outside the default location.
+    WarehouseInstance.Tuple tuple =
+        primary.run("use " + primaryDbName)
+            .run("create external table a (i int, j int) "
+                + "row format delimited fields terminated by ',' "
+                + "location '" + externalTableLocation.toUri() + "'")
+            .run("insert into a values(1,2)")
+            .run("create external table b (id int)")
+            .run("insert into b values(5)")
+            .run("create external table c (place string) partitioned by (country "
+                + "string)")
+            .run("insert into table c partition(country='india') values "
+                + "('bangalore')")
+            .run("ALTER TABLE c ADD PARTITION (country='france') LOCATION '"
+                + externalTablePartitionLocation.toString() + "'")
+            .run("insert into c partition(country='france') values('paris')")
+        .dump(primaryDbName, withClause);
+
+    // Do a load and verify all the data is there.
+    replica.load(replicatedDbName, primaryDbName, withClause)
+        .run("use " + replicatedDbName)
+        .run("select i from a where j=2")
+        .verifyResult("1")
+        .run("select id from b")
+        .verifyResult("5")
+        .run("select place from c where country='india'")
+        .verifyResult("bangalore")
+        .run("select place from c where country='france'")
+        .verifyResult("paris");
+
+    // Check the task copied post bootstrap, It should have the database loc,
+    // the table 'a' since that is outside of the default location, and the
+    // 'c', since its partition is out of the default location.
+    assertExternalFileInfo(Arrays.asList(primaryDbName.toLowerCase(), "a", "c"),
+        tuple.dumpLocation, primaryDbName, false);
+
+    // Add more data to tables and do a incremental run and create another
+    // tables one inside and other outside default location.
+
+    externalTableLocation = new Path(
+        "/" + testName.getMethodName() + "/" + primaryDbName + "/" + "newout/");
+    fs.mkdirs(externalTableLocation, new FsPermission("777"));
+    tuple =
+        primary.run("use " + primaryDbName)
+            .run("insert into a values(3,4)")
+            .run("insert into b values(6)")
+            .run("insert into table c partition(country='india') values "
+                + "('delhi')")
+            .run("insert into c partition(country='france') values('lyon')")
+            .run("create external table newin (id int)")
+            .run("insert into newin values(1)")
+            .run("create external table newout(id int) row format delimited "
+                + "fields terminated by ',' location '" + externalTableLocation
+                .toUri() + "'")
+            .run("insert into newout values(2)")
+            .dump(primaryDbName, withClause);
+
+    // Do an incremental load and check if all the old and new data is there.
+    replica.load(replicatedDbName, primaryDbName, withClause)
+        .run("use " + replicatedDbName)
+        .run("select i from a where j=4")
+        .verifyResult("3")
+        .run("select id from b")
+        .verifyResults(new String[]{"5", "6"})
+        .run("select place from c where country='india'")
+        .verifyResults(new String[]{"bangalore", "delhi"})
+        .run("select place from c where country='france'")
+        .verifyResults(new String[]{"paris", "lyon"})
+        .run("select id from newin")
+        .verifyResult("1")
+        .run("select id from newout")
+        .verifyResult("2");
+
+    // New table in the warehouse shouldn't be there but the table created
+    // outside should be there, apart from the ones in the previous run.
+
+    assertExternalFileInfo(

Review comment:
       Validation should be on _file_list_external file. 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplExternalTables.java
##########
@@ -149,10 +150,15 @@ void dataLocationDump(Table table, FileList fileList, HiveConf conf)
             "only External tables can be writen via this writer, provided table is " + table
                 .getTableType());
       }
-      Path fullyQualifiedDataLocation =
-          PathBuilder.fullyQualifiedHDFSUri(table.getDataLocation(), FileSystem.get(hiveConf));
-      write(lineFor(table.getTableName(), fullyQualifiedDataLocation, hiveConf));
-      dirLocationToCopy(fileList, fullyQualifiedDataLocation, conf);
+      Path fullyQualifiedDataLocation = PathBuilder
+          .fullyQualifiedHDFSUri(table.getDataLocation(),
+              FileSystem.get(hiveConf));
+      if (isTableLevelReplication || !FileUtils

Review comment:
       nit: format the code




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org