You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/01 12:20:55 UTC

[GitHub] [iceberg] marton-bod commented on a change in pull request #1840: Hive: Refactor HiveIcebergStorageHandler tests to use catalogs as parameters

marton-bod commented on a change in pull request #1840:
URL: https://github.com/apache/iceberg/pull/1840#discussion_r533361246



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestTables.java
##########
@@ -83,12 +92,107 @@ public Tables tables() {
   public abstract String locationForCreateTableSQL(TableIdentifier identifier);
 
   /**
-   * If the {@link Catalogs#LOCATION} is needed for {@link Catalogs#loadTable(Configuration, Properties)} then this
-   * method should provide the location string. It should return <code>null</code> if the location is not needed.
-   * @param identifier The table identifier
-   * @return The location string for loadTable operation
+   * If creating the Hive table independently is needed for the given Catalog then this should return the Hive SQL
+   * string which is needed to be executed.
+   * @param identifier The table identifier (the namespace should be "default")
+   * @return The SQL string - which should be executed, null - if it is not needed.
+   */
+  public String createHiveTableSQL(TableIdentifier identifier) {
+    return String.format("CREATE TABLE default.%s STORED BY '%s' %s", identifier.name(),
+        HiveIcebergStorageHandler.class.getName(), locationForCreateTableSQL(identifier));
+  }
+
+  /**
+   * Loads the given table from the actual catalog. Overridden by HadoopTables, since the parameter of the
+   * {@link Tables#load(String)} should be the full path of the table metadata directory
+   * @param identifier The table we want to load
+   * @return The Table loaded from the Catalog
    */
-  public abstract String loadLocation(TableIdentifier identifier);
+  public Table loadTable(TableIdentifier identifier) {
+    return tables.load(identifier.toString());
+  }
+
+  /**
+   * Creates a Hive test table. Creates the Iceberg table/data and creates the corresponding Hive table as well when
+   * needed. The table will be in the 'default' database.
+   * @param shell The HiveShell used for Hive table creation
+   * @param tableName The name of the test table
+   * @param schema The schema used for the table creation
+   * @param fileFormat The file format used for writing the data
+   * @param records The records with which the table is populated
+   * @throws IOException If there is an error writing data
+   */
+  public void createTable(TestHiveShell shell, String tableName, Schema schema, FileFormat fileFormat,
+      List<Record> records) throws IOException {
+    createIcebergTable(shell.getHiveConf(), tableName, schema, fileFormat, records);
+    String createHiveSQL = createHiveTableSQL(TableIdentifier.of("default", tableName));
+    if (createHiveSQL != null) {
+      shell.executeStatement(createHiveSQL);
+    }
+  }
+
+  /**
+   * Creates a Hive test table. Creates the Iceberg table/data and creates the corresponding Hive table as well when

Review comment:
       nit: doc here is the same as above, even though behaviour is different with extra data generation step

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandler.java
##########
@@ -145,41 +142,47 @@
       ImmutableSet.of("bucketing_version", StatsSetupConst.ROW_COUNT,
           StatsSetupConst.RAW_DATA_SIZE, StatsSetupConst.TOTAL_SIZE, StatsSetupConst.NUM_FILES, "numFilesErasureCoded");
 
-  private static TestHiveShell shell;
-
   private static final List<Type> SUPPORTED_TYPES =
           ImmutableList.of(Types.BooleanType.get(), Types.IntegerType.get(), Types.LongType.get(),
                   Types.FloatType.get(), Types.DoubleType.get(), Types.DateType.get(), Types.TimestampType.withZone(),
                   Types.TimestampType.withoutZone(), Types.StringType.get(), Types.BinaryType.get(),
                   Types.DecimalType.of(3, 1));
 
-  private TestTables testTables;
-
-  public abstract TestTables testTables(Configuration conf, TemporaryFolder tmp) throws IOException;
-
-  @Parameters(name = "fileFormat={0}, engine={1}")
+  @Parameters(name = "fileFormat={0}, engine={1}, catalog={2}")
   public static Collection<Object[]> parameters() {
-    Collection<Object[]> testParams = new ArrayList<>();
-    testParams.add(new Object[] { FileFormat.PARQUET, "mr" });
-    testParams.add(new Object[] { FileFormat.ORC, "mr" });
-    testParams.add(new Object[] { FileFormat.AVRO, "mr" });
-
-    // include Tez tests only for Java 8
     String javaVersion = System.getProperty("java.specification.version");
-    if (javaVersion.equals("1.8")) {
-      testParams.add(new Object[] { FileFormat.PARQUET, "tez" });
-      testParams.add(new Object[] { FileFormat.ORC, "tez" });
-      testParams.add(new Object[] { FileFormat.AVRO, "tez" });
+
+    Collection<Object[]> testParams = new ArrayList<>();
+    for (FileFormat fileFormat : fileFormats) {
+      for (String engine : executionEngines) {
+        // include Tez tests only for Java 8
+        if (javaVersion.equals("1.8")) {

Review comment:
       I think right now this means that the test params are only populated for Java8. The logic should be to add all file formats and all catalog types first, and then add Tez as an engine only if Java8 is detected, otherwise only mr.

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestTables.java
##########
@@ -83,12 +92,107 @@ public Tables tables() {
   public abstract String locationForCreateTableSQL(TableIdentifier identifier);
 
   /**
-   * If the {@link Catalogs#LOCATION} is needed for {@link Catalogs#loadTable(Configuration, Properties)} then this
-   * method should provide the location string. It should return <code>null</code> if the location is not needed.
-   * @param identifier The table identifier
-   * @return The location string for loadTable operation
+   * If creating the Hive table independently is needed for the given Catalog then this should return the Hive SQL
+   * string which is needed to be executed.
+   * @param identifier The table identifier (the namespace should be "default")
+   * @return The SQL string - which should be executed, null - if it is not needed.
+   */
+  public String createHiveTableSQL(TableIdentifier identifier) {
+    return String.format("CREATE TABLE default.%s STORED BY '%s' %s", identifier.name(),
+        HiveIcebergStorageHandler.class.getName(), locationForCreateTableSQL(identifier));
+  }
+
+  /**
+   * Loads the given table from the actual catalog. Overridden by HadoopTables, since the parameter of the
+   * {@link Tables#load(String)} should be the full path of the table metadata directory
+   * @param identifier The table we want to load
+   * @return The Table loaded from the Catalog
    */
-  public abstract String loadLocation(TableIdentifier identifier);
+  public Table loadTable(TableIdentifier identifier) {
+    return tables.load(identifier.toString());
+  }
+
+  /**
+   * Creates a Hive test table. Creates the Iceberg table/data and creates the corresponding Hive table as well when
+   * needed. The table will be in the 'default' database.
+   * @param shell The HiveShell used for Hive table creation
+   * @param tableName The name of the test table
+   * @param schema The schema used for the table creation
+   * @param fileFormat The file format used for writing the data
+   * @param records The records with which the table is populated
+   * @throws IOException If there is an error writing data
+   */
+  public void createTable(TestHiveShell shell, String tableName, Schema schema, FileFormat fileFormat,
+      List<Record> records) throws IOException {
+    createIcebergTable(shell.getHiveConf(), tableName, schema, fileFormat, records);
+    String createHiveSQL = createHiveTableSQL(TableIdentifier.of("default", tableName));
+    if (createHiveSQL != null) {
+      shell.executeStatement(createHiveSQL);
+    }
+  }
+
+  /**
+   * Creates a Hive test table. Creates the Iceberg table/data and creates the corresponding Hive table as well when
+   * needed. The table will be in the 'default' database.
+   * @param shell The HiveShell used for Hive table creation
+   * @param tableName The name of the test table
+   * @param schema The schema used for the table creation
+   * @param fileFormat The file format used for writing the data
+   * @param numRecords The number of records should be generated and stored in the table
+   * @param seed The seed used for the random record generation
+   * @throws IOException If there is an error writing data
+   */
+  public List<Record> createTableWithGeneratedRecords(TestHiveShell shell, String tableName, Schema schema,
+      FileFormat fileFormat, int numRecords, long seed) throws IOException {

Review comment:
       would it make sense to inline the seed variable since it's always 0 in the tests? Having the extra argument makes it harder to parse/eyeball the function calls in tests as well

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestTables.java
##########
@@ -83,12 +92,107 @@ public Tables tables() {
   public abstract String locationForCreateTableSQL(TableIdentifier identifier);
 
   /**
-   * If the {@link Catalogs#LOCATION} is needed for {@link Catalogs#loadTable(Configuration, Properties)} then this
-   * method should provide the location string. It should return <code>null</code> if the location is not needed.
-   * @param identifier The table identifier
-   * @return The location string for loadTable operation
+   * If creating the Hive table independently is needed for the given Catalog then this should return the Hive SQL
+   * string which is needed to be executed.
+   * @param identifier The table identifier (the namespace should be "default")
+   * @return The SQL string - which should be executed, null - if it is not needed.
+   */
+  public String createHiveTableSQL(TableIdentifier identifier) {
+    return String.format("CREATE TABLE default.%s STORED BY '%s' %s", identifier.name(),

Review comment:
       quick question: shouldn't we allow to pass in a custom database name via identifier.namespace()? any reason we have to force "default"?

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandler.java
##########
@@ -77,10 +73,11 @@
 import static org.junit.runners.Parameterized.Parameters;
 
 @RunWith(Parameterized.class)
-public abstract class HiveIcebergStorageHandlerBaseTest {
+public class TestHiveIcebergStorageHandler {
+  private static final FileFormat[] fileFormats =

Review comment:
       can we use upper case naming for these two static final variables? FILE_FORMATS and EXEC_ENGINES

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestTables.java
##########
@@ -83,12 +92,107 @@ public Tables tables() {
   public abstract String locationForCreateTableSQL(TableIdentifier identifier);
 
   /**
-   * If the {@link Catalogs#LOCATION} is needed for {@link Catalogs#loadTable(Configuration, Properties)} then this
-   * method should provide the location string. It should return <code>null</code> if the location is not needed.
-   * @param identifier The table identifier
-   * @return The location string for loadTable operation
+   * If creating the Hive table independently is needed for the given Catalog then this should return the Hive SQL
+   * string which is needed to be executed.
+   * @param identifier The table identifier (the namespace should be "default")
+   * @return The SQL string - which should be executed, null - if it is not needed.
+   */
+  public String createHiveTableSQL(TableIdentifier identifier) {
+    return String.format("CREATE TABLE default.%s STORED BY '%s' %s", identifier.name(),
+        HiveIcebergStorageHandler.class.getName(), locationForCreateTableSQL(identifier));
+  }
+
+  /**
+   * Loads the given table from the actual catalog. Overridden by HadoopTables, since the parameter of the
+   * {@link Tables#load(String)} should be the full path of the table metadata directory
+   * @param identifier The table we want to load
+   * @return The Table loaded from the Catalog
    */
-  public abstract String loadLocation(TableIdentifier identifier);
+  public Table loadTable(TableIdentifier identifier) {
+    return tables.load(identifier.toString());
+  }
+
+  /**
+   * Creates a Hive test table. Creates the Iceberg table/data and creates the corresponding Hive table as well when
+   * needed. The table will be in the 'default' database.
+   * @param shell The HiveShell used for Hive table creation
+   * @param tableName The name of the test table
+   * @param schema The schema used for the table creation
+   * @param fileFormat The file format used for writing the data
+   * @param records The records with which the table is populated
+   * @throws IOException If there is an error writing data
+   */
+  public void createTable(TestHiveShell shell, String tableName, Schema schema, FileFormat fileFormat,
+      List<Record> records) throws IOException {
+    createIcebergTable(shell.getHiveConf(), tableName, schema, fileFormat, records);
+    String createHiveSQL = createHiveTableSQL(TableIdentifier.of("default", tableName));
+    if (createHiveSQL != null) {

Review comment:
       I think this string will never be null

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestTables.java
##########
@@ -83,12 +92,107 @@ public Tables tables() {
   public abstract String locationForCreateTableSQL(TableIdentifier identifier);
 
   /**
-   * If the {@link Catalogs#LOCATION} is needed for {@link Catalogs#loadTable(Configuration, Properties)} then this
-   * method should provide the location string. It should return <code>null</code> if the location is not needed.
-   * @param identifier The table identifier
-   * @return The location string for loadTable operation
+   * If creating the Hive table independently is needed for the given Catalog then this should return the Hive SQL
+   * string which is needed to be executed.
+   * @param identifier The table identifier (the namespace should be "default")
+   * @return The SQL string - which should be executed, null - if it is not needed.

Review comment:
       it doesn't seem like it will give back null in any scenario - when should it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org