You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/05/25 16:12:24 UTC

[GitHub] [hive] kasakrisz opened a new pull request, #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

kasakrisz opened a new pull request, #3324:
URL: https://github.com/apache/hive/pull/3324

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   * Add new boolean configuration to jobconf: `iceberg.mr.fetch.virtual.columns` and `hive.io.file.fetch.virtual.columns`
   * Populate the config value by TablesScanDesc related to TS operators in the plan
   * In case of iceberg read operation initialize the expected schema by the table's columns defined and add the virtual columns if the setting `iceberg.mr.fetch.virtual.columns` is true
   * Add new config setting to jobconf which is passed to serde init from FS operator: `file.sink.write.operation.`
   * In case of iceberg serde initialize the schema based on `file.sink.write.operation.`
   
   ### Why are the changes needed?
   An execution plan can have multiple TS and FS operators and some of them need virtual columns others not.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   ```
   mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestIcebergCliDriver -Dqfile=query_iceberg_virtualcol.q -pl itests/qtest-iceberg -Pitests -Piceberg
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886842968


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -549,4 +534,43 @@ private static Schema schemaWithoutConstantsAndMeta(Schema readSchema, Map<Integ
     }
   }
 
+  public static class VirtualColumnAwareIterator<T> implements CloseableIterator<T> {

Review Comment:
   Moved and replaced `4` to `FILE_READ_META_COLS.size()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886540554


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -15005,6 +15004,12 @@ private AcidUtils.Operation getAcidType(String destination) {
             AcidUtils.Operation.INSERT);
   }
 
+  private Context.Operation getWriteOperation(String destination) {

Review Comment:
   The value in `destination` has a changing part and it is can not be mapped easily to an enum constant



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886538949


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -11433,6 +11435,7 @@ private Operator genTablePlan(String alias, QB qb) throws SemanticException {
       // Determine row schema for TSOP.
       // Include column names from SerDe, the partition and virtual columns.
       rwsch = new RowResolver();
+

Review Comment:
   reverted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz merged pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz merged PR #3324:
URL: https://github.com/apache/hive/pull/3324


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886632668


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -15005,6 +15004,12 @@ private AcidUtils.Operation getAcidType(String destination) {
             AcidUtils.Operation.INSERT);
   }
 
+  private Context.Operation getWriteOperation(String destination) {

Review Comment:
   Is this reading the operation set by `HiveCustomStorageHandlerUtils.setWrite(hconf, tableName)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886591933


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -78,8 +78,8 @@ public void initialize(QueryState queryState, QueryPlan queryPlan, TaskQueue tas
       if (source instanceof TableScanOperator) {
         TableScanOperator ts = (TableScanOperator) source;
         // push down projections
-        ColumnProjectionUtils.appendReadColumns(
-            job, ts.getNeededColumnIDs(), ts.getNeededColumns(), ts.getNeededNestedColumnPaths());
+        ColumnProjectionUtils.appendReadColumns(job, ts.getNeededColumnIDs(), ts.getNeededColumns(),
+                ts.getNeededNestedColumnPaths(), ts.conf.hasVirtualCols());

Review Comment:
   nit: Shall we expose `hasVirtualCols` on TSOperator instead of exposing and using the `conf`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886536461


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -259,22 +258,27 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.inMemoryDataModel = conf.getEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL,
               InputFormatConfig.InMemoryDataModel.GENERIC);
       this.currentIterator = open(tasks.next(), expectedSchema).iterator();
-      Operation operation = HiveIcebergStorageHandler.operation(conf, conf.get(Catalogs.NAME));
-      this.updateOrDelete = Operation.DELETE.equals(operation) || Operation.UPDATE.equals(operation);
+      this.fetchVirtualColumns = InputFormatConfig.fetchVirtualColumns(conf);
     }
 
     @Override
     public boolean nextKeyValue() throws IOException {
       while (true) {
         if (currentIterator.hasNext()) {
           current = currentIterator.next();
-          if (updateOrDelete) {
+          if (fetchVirtualColumns) {
             GenericRecord rec = (GenericRecord) current;
             PositionDeleteInfo.setIntoConf(conf,
                 IcebergAcidUtil.parseSpecId(rec),
                 IcebergAcidUtil.computePartitionHash(rec),
                 IcebergAcidUtil.parseFilePath(rec),
                 IcebergAcidUtil.parseFilePosition(rec));
+            GenericRecord tmp = GenericRecord.create(
+                    new Schema(expectedSchema.columns().subList(4, expectedSchema.columns().size())));
+            for (int i = 4; i < expectedSchema.columns().size(); ++i) {

Review Comment:
   Moved



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881897659


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -739,6 +742,11 @@ protected void initializeOp(Configuration hconf) throws HiveException {
     }
   }
 
+  private void setWriteOperation(Configuration conf) {

Review Comment:
   This is the same as we write out/read back stuff in the `HiveIcebergStorageHandler.operation`.
   If it is not removed from there we should create commonly used methods for it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886584030


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -739,6 +742,11 @@ protected void initializeOp(Configuration hconf) throws HiveException {
     }
   }
 
+  private void setWriteOperation(Configuration conf) {

Review Comment:
   Would it make sense to keep the read/set part in the same class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886711886


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -739,6 +742,11 @@ protected void initializeOp(Configuration hconf) throws HiveException {
     }
   }
 
+  private void setWriteOperation(Configuration conf) {

Review Comment:
   Moved both get/set to `HiveCustomStorageHandlerUtils`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886714583


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -15005,6 +15004,12 @@ private AcidUtils.Operation getAcidType(String destination) {
             AcidUtils.Operation.INSERT);
   }
 
+  private Context.Operation getWriteOperation(String destination) {

Review Comment:
   No, these `destinations` are coming from the QueryParserInfo objects getQB().getParseInfo().getClauseNames().iterator().next();
   and set in the UpdateDeleteSA like
   ```
         rewrittenCtx.setOperation(Context.Operation.DELETE);
         rewrittenCtx.addDestNamePrefix(1, Context.DestClausePrefix.DELETE);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886783428


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -78,8 +78,8 @@ public void initialize(QueryState queryState, QueryPlan queryPlan, TaskQueue tas
       if (source instanceof TableScanOperator) {
         TableScanOperator ts = (TableScanOperator) source;
         // push down projections
-        ColumnProjectionUtils.appendReadColumns(
-            job, ts.getNeededColumnIDs(), ts.getNeededColumns(), ts.getNeededNestedColumnPaths());
+        ColumnProjectionUtils.appendReadColumns(job, ts.getNeededColumnIDs(), ts.getNeededColumns(),
+                ts.getNeededNestedColumnPaths(), ts.conf.hasVirtualCols());

Review Comment:
   Do we usually expose the config object of the operators for tasks?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886717199


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -616,6 +617,8 @@ protected void initializeOp(Configuration hconf) throws HiveException {
       initializeSpecPath();
       fs = specPath.getFileSystem(hconf);
 
+      hconf.set(WRITE_OPERATION_CONFIG_PREFIX + getConf().getTableInfo().getTableName(),

Review Comment:
   Moved both get/set write operation to HiveCustomStorageHandlerUtils.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881892934


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -259,22 +258,27 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.inMemoryDataModel = conf.getEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL,
               InputFormatConfig.InMemoryDataModel.GENERIC);
       this.currentIterator = open(tasks.next(), expectedSchema).iterator();
-      Operation operation = HiveIcebergStorageHandler.operation(conf, conf.get(Catalogs.NAME));
-      this.updateOrDelete = Operation.DELETE.equals(operation) || Operation.UPDATE.equals(operation);
+      this.fetchVirtualColumns = InputFormatConfig.fetchVirtualColumns(conf);
     }
 
     @Override
     public boolean nextKeyValue() throws IOException {
       while (true) {
         if (currentIterator.hasNext()) {
           current = currentIterator.next();
-          if (updateOrDelete) {
+          if (fetchVirtualColumns) {
             GenericRecord rec = (GenericRecord) current;
             PositionDeleteInfo.setIntoConf(conf,
                 IcebergAcidUtil.parseSpecId(rec),
                 IcebergAcidUtil.computePartitionHash(rec),
                 IcebergAcidUtil.parseFilePath(rec),
                 IcebergAcidUtil.parseFilePosition(rec));
+            GenericRecord tmp = GenericRecord.create(

Review Comment:
   We should not create the `tmp` record for every value. We should reuse a previously created record



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886537498


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -259,22 +258,27 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.inMemoryDataModel = conf.getEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL,
               InputFormatConfig.InMemoryDataModel.GENERIC);
       this.currentIterator = open(tasks.next(), expectedSchema).iterator();
-      Operation operation = HiveIcebergStorageHandler.operation(conf, conf.get(Catalogs.NAME));
-      this.updateOrDelete = Operation.DELETE.equals(operation) || Operation.UPDATE.equals(operation);
+      this.fetchVirtualColumns = InputFormatConfig.fetchVirtualColumns(conf);
     }
 
     @Override
     public boolean nextKeyValue() throws IOException {
       while (true) {
         if (currentIterator.hasNext()) {
           current = currentIterator.next();
-          if (updateOrDelete) {
+          if (fetchVirtualColumns) {
             GenericRecord rec = (GenericRecord) current;
             PositionDeleteInfo.setIntoConf(conf,
                 IcebergAcidUtil.parseSpecId(rec),
                 IcebergAcidUtil.computePartitionHash(rec),
                 IcebergAcidUtil.parseFilePath(rec),
                 IcebergAcidUtil.parseFilePosition(rec));
+            GenericRecord tmp = GenericRecord.create(

Review Comment:
   Created a separate class to handle and wrap this. `VirtualColumnAwareIterator`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886590600


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -549,4 +534,43 @@ private static Schema schemaWithoutConstantsAndMeta(Schema readSchema, Map<Integ
     }
   }
 
+  public static class VirtualColumnAwareIterator<T> implements CloseableIterator<T> {

Review Comment:
   nit: maybe move this to the IcebergAcidUtil, so we do not have to use magic numbers, like `4`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881898206


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -932,7 +940,9 @@ protected void createBucketForFileIdx(FSPaths fsp, int filesIdx)
             && !FileUtils.mkdir(fs, outPath.getParent(), hconf)) {
           LOG.warn("Unable to create directory with inheritPerms: " + outPath);
         }
-        fsp.outWriters[filesIdx] = HiveFileFormatUtils.getHiveRecordWriter(jc, conf.getTableInfo(),
+        JobConf jobConf = new JobConf(jc);

Review Comment:
   Why do we need to copy the jobConf?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881900188


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -11433,6 +11435,7 @@ private Operator genTablePlan(String alias, QB qb) throws SemanticException {
       // Determine row schema for TSOP.
       // Include column names from SerDe, the partition and virtual columns.
       rwsch = new RowResolver();
+

Review Comment:
   nit: maybe just a mistake?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881899259


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -932,7 +940,9 @@ protected void createBucketForFileIdx(FSPaths fsp, int filesIdx)
             && !FileUtils.mkdir(fs, outPath.getParent(), hconf)) {
           LOG.warn("Unable to create directory with inheritPerms: " + outPath);
         }
-        fsp.outWriters[filesIdx] = HiveFileFormatUtils.getHiveRecordWriter(jc, conf.getTableInfo(),
+        JobConf jobConf = new JobConf(jc);
+        setWriteOperation(jobConf);
+        fsp.outWriters[filesIdx] = HiveFileFormatUtils.getHiveRecordWriter(jobConf, conf.getTableInfo(),

Review Comment:
   Why not just add the operation as a parameter to the `getHiveRecordWriter` method? Passing parameters in a conf seems like a bad practice, if not necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886787413


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -739,6 +742,11 @@ protected void initializeOp(Configuration hconf) throws HiveException {
     }
   }
 
+  private void setWriteOperation(Configuration conf) {

Review Comment:
   Thx



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886542319


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -932,7 +940,9 @@ protected void createBucketForFileIdx(FSPaths fsp, int filesIdx)
             && !FileUtils.mkdir(fs, outPath.getParent(), hconf)) {
           LOG.warn("Unable to create directory with inheritPerms: " + outPath);
         }
-        fsp.outWriters[filesIdx] = HiveFileFormatUtils.getHiveRecordWriter(jc, conf.getTableInfo(),
+        JobConf jobConf = new JobConf(jc);
+        setWriteOperation(jobConf);
+        fsp.outWriters[filesIdx] = HiveFileFormatUtils.getHiveRecordWriter(jobConf, conf.getTableInfo(),

Review Comment:
   Changing the method signature would alters all file formats Hive supports



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886716379


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -78,8 +78,8 @@ public void initialize(QueryState queryState, QueryPlan queryPlan, TaskQueue tas
       if (source instanceof TableScanOperator) {
         TableScanOperator ts = (TableScanOperator) source;
         // push down projections
-        ColumnProjectionUtils.appendReadColumns(
-            job, ts.getNeededColumnIDs(), ts.getNeededColumns(), ts.getNeededNestedColumnPaths());
+        ColumnProjectionUtils.appendReadColumns(job, ts.getNeededColumnIDs(), ts.getNeededColumns(),
+                ts.getNeededNestedColumnPaths(), ts.conf.hasVirtualCols());

Review Comment:
   I don't see the benefit of exposing this on TSOperator. Maybe the call here would be shorter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881893733


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -259,22 +258,27 @@ public void initialize(InputSplit split, TaskAttemptContext newContext) {
       this.inMemoryDataModel = conf.getEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL,
               InputFormatConfig.InMemoryDataModel.GENERIC);
       this.currentIterator = open(tasks.next(), expectedSchema).iterator();
-      Operation operation = HiveIcebergStorageHandler.operation(conf, conf.get(Catalogs.NAME));
-      this.updateOrDelete = Operation.DELETE.equals(operation) || Operation.UPDATE.equals(operation);
+      this.fetchVirtualColumns = InputFormatConfig.fetchVirtualColumns(conf);
     }
 
     @Override
     public boolean nextKeyValue() throws IOException {
       while (true) {
         if (currentIterator.hasNext()) {
           current = currentIterator.next();
-          if (updateOrDelete) {
+          if (fetchVirtualColumns) {
             GenericRecord rec = (GenericRecord) current;
             PositionDeleteInfo.setIntoConf(conf,
                 IcebergAcidUtil.parseSpecId(rec),
                 IcebergAcidUtil.computePartitionHash(rec),
                 IcebergAcidUtil.parseFilePath(rec),
                 IcebergAcidUtil.parseFilePosition(rec));
+            GenericRecord tmp = GenericRecord.create(
+                    new Schema(expectedSchema.columns().subList(4, expectedSchema.columns().size())));
+            for (int i = 4; i < expectedSchema.columns().size(); ++i) {

Review Comment:
   For this kind of conversion we usually create a method in `IcebergAcidUtil`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886633690


##########
ql/src/java/org/apache/hadoop/hive/ql/security/authorization/HiveCustomStorageHandlerUtils.java:
##########
@@ -48,4 +54,13 @@ public static Map<String, String> getTableProperties(Table table) {
             .ifPresent(tblProps::putAll);
         return tblProps;
     }
+
+    public static Context.Operation operation(Configuration conf, String tableName) {

Review Comment:
   So maybe another method like:
   `HiveCustomStorageHandlerUtils.setOperstion(hconf, tableName, operation)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886570597


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -739,6 +742,11 @@ protected void initializeOp(Configuration hconf) throws HiveException {
     }
   }
 
+  private void setWriteOperation(Configuration conf) {

Review Comment:
   Moved the read part to InputFormatConfig.java, but it is still set in FileSinkOperator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r881902028


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -15005,6 +15004,12 @@ private AcidUtils.Operation getAcidType(String destination) {
             AcidUtils.Operation.INSERT);
   }
 
+  private Context.Operation getWriteOperation(String destination) {

Review Comment:
   Maybe a public method where we set the operation? Also for the conversion we could use enum



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886629784


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java:
##########
@@ -616,6 +617,8 @@ protected void initializeOp(Configuration hconf) throws HiveException {
       initializeSpecPath();
       fs = specPath.getFileSystem(hconf);
 
+      hconf.set(WRITE_OPERATION_CONFIG_PREFIX + getConf().getTableInfo().getTableName(),

Review Comment:
   Could we just do this like:
   `HiveCustomStorageHandlerUtils.setWrite(hconf, tableName)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886717900


##########
ql/src/java/org/apache/hadoop/hive/ql/security/authorization/HiveCustomStorageHandlerUtils.java:
##########
@@ -48,4 +54,13 @@ public static Map<String, String> getTableProperties(Table table) {
             .ifPresent(tblProps::putAll);
         return tblProps;
     }
+
+    public static Context.Operation operation(Configuration conf, String tableName) {

Review Comment:
   Yes, this is what I did in my last commit. Only the method name is different :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886590600


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -549,4 +534,43 @@ private static Schema schemaWithoutConstantsAndMeta(Schema readSchema, Map<Integ
     }
   }
 
+  public static class VirtualColumnAwareIterator<T> implements CloseableIterator<T> {

Review Comment:
   nit: maybe move this class to the IcebergAcidUtil, so we do not have to use magic numbers, like `4`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886789104


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:
##########
@@ -15005,6 +15004,12 @@ private AcidUtils.Operation getAcidType(String destination) {
             AcidUtils.Operation.INSERT);
   }
 
+  private Context.Operation getWriteOperation(String destination) {

Review Comment:
   Thanks for the explanation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a diff in pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on code in PR #3324:
URL: https://github.com/apache/hive/pull/3324#discussion_r886809647


##########
ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java:
##########
@@ -78,8 +78,8 @@ public void initialize(QueryState queryState, QueryPlan queryPlan, TaskQueue tas
       if (source instanceof TableScanOperator) {
         TableScanOperator ts = (TableScanOperator) source;
         // push down projections
-        ColumnProjectionUtils.appendReadColumns(
-            job, ts.getNeededColumnIDs(), ts.getNeededColumns(), ts.getNeededNestedColumnPaths());
+        ColumnProjectionUtils.appendReadColumns(job, ts.getNeededColumnIDs(), ts.getNeededColumns(),
+                ts.getNeededNestedColumnPaths(), ts.conf.hasVirtualCols());

Review Comment:
   Unfortunately it is not consistent when we are expose or not. Example.:
   https://github.com/apache/hive/blob/6626b5564ee206db5a656d2f611ed71f10a0ffc1/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java#L86
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on pull request #3324: HIVE-26264: Iceberg integration: Fetch virtual columns on demand

Posted by GitBox <gi...@apache.org>.
pvary commented on PR #3324:
URL: https://github.com/apache/hive/pull/3324#issuecomment-1137555779

   Overall, I like this change very much!
   Much more readable and clean!
   Left some comments.
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org