You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "huangxiaopingRD (via GitHub)" <gi...@apache.org> on 2023/02/24 04:58:15 UTC

[GitHub] [hudi] huangxiaopingRD opened a new issue, #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column

huangxiaopingRD opened a new issue, #8036:
URL: https://github.com/apache/hudi/issues/8036

   **Describe the problem you faced**
   
   We have a workflow that is `hive table (upstream) -> hive table (downstream)`, and we want to modify it to `hudi table-upstream) -> hive table (downstream)`. However, there is a problem. For example, the downstream may use a SQL similar to "`insert into hive_table select * from hudi_table`". At this time, the number of read data columns and the number of columns to be inserted into the table will be inconsistent. The reason is that the metadata column of Hudi is added after the expansion of star(*).
   
   Our initial solution now is to add a rule to spark. When processing the execution plan, if it is the hudi metadata column added after star expansion, delete it and return the execution plan without the metadata column.
   
   I wonder if the hudi community has a better solution for such a case. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huangxiaopingRD commented on issue #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column

Posted by "huangxiaopingRD (via GitHub)" <gi...@apache.org>.

huangxiaopingRD commented on issue #8036:
URL: https://github.com/apache/hudi/issues/8036#issuecomment-1455335723

   > good question.
   > 
   > Depending on what sql tool you might use, you can try to explore how to select all columns except a few. then, you can ignore the hoodie meta columns explicitly in your insert into statement.
   > 
   > For eg, for spark sql, you can do the following
   > 
   > spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
   > 
   > #select all columns except a,b sql("select `(a|b)?+.+` from tmp").show() #+---+---+ #| id| c| #+---+---+ #| 1| 4| #+---+---+
   > 
   > Ref: https://stackoverflow.com/questions/63127263/how-to-select-all-columns-except-2-of-them-from-a-large-table-on-pyspark-sql
   > 
   > Hive: https://stackoverflow.com/questions/51227890/hive-how-to-select-all-but-one-column
   
   Thanks @nsivabalan , this is a better way. But we hope to be transparent to users, we finally decided to adopt the method of injecting Spark's Rule to be compatible with this case from the system point of view.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] huangxiaopingRD closed issue #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column

Posted by "huangxiaopingRD (via GitHub)" <gi...@apache.org>.

huangxiaopingRD closed issue #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column
URL: https://github.com/apache/hudi/issues/8036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8036:
URL: https://github.com/apache/hudi/issues/8036#issuecomment-1453987238

   good question.
   
   Depending on what sql tool you might use, you can try to explore how to select all columns except a few. then, you can ignore the hoodie meta columns explicitly in your insert into statement. 
   
   
   For eg, for spark sql, you can do the following 
   
   spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
   
   #select all columns except a,b
   sql("select `(a|b)?+.+` from tmp").show()
   #+---+---+
   #| id|  c|
   #+---+---+
   #|  1|  4|
   #+---+---+
   
   Ref: https://stackoverflow.com/questions/63127263/how-to-select-all-columns-except-2-of-them-from-a-large-table-on-pyspark-sql
   
   Hive: https://stackoverflow.com/questions/51227890/hive-how-to-select-all-but-one-column
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org