You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/24 23:30:48 UTC

[GitHub] [incubator-hudi] umehrot2 opened a new pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

umehrot2 opened a new pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559


   *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Support to store and read schema from **HoodieCommitMetadata** has been added in Hudi recently. This PR integrated Hive Sync to read schema from the metadata, and fallback to reading from files only if the schema from metadata is unavailable (for backwards compatibility).
   
   This is a pre-requisite to https://jira.apache.org/jira/browse/HUDI-620 where we want to store the merged bootstrap schema in commit metadata and obtain it during hive-sync. This avoids us having to look up the indexes for external files and merging the schemas just for hive-sync.
   
   In addition, this PR does a fix in HoodieWrite client to fallback on reading schema from the files in case schema from metadata is not available (for validating the schema).
   
   ## Brief change log
   
   - Fix in HoodieWriteClient to fallback on reading schema from files
   - Hive sync changes and unit tests to read schema from HoodieCommitMetadata
   
   ## Verify this pull request
   
   This change added unit tests.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] codecov-io commented on pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#issuecomment-621759888


   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=h1) Report
   > Merging [#1559](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=desc) into [master](https://codecov.io/gh/apache/incubator-hudi/commit/06dae30297ea02ab122c9029a54f7927e8212039&el=desc) will **decrease** coverage by `0.08%`.
   > The diff coverage is `53.84%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1559/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #1559      +/-   ##
   ============================================
   - Coverage     71.85%   71.76%   -0.09%     
     Complexity      294      294              
   ============================================
     Files           385      385              
     Lines         16540    16562      +22     
     Branches       1661     1666       +5     
   ============================================
   + Hits          11884    11886       +2     
   - Misses         3925     3943      +18     
   - Partials        731      733       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1559/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `79.83% <0.00%> (-4.99%)` | `0.00 <0.00> (ø)` | |
   | [.../apache/hudi/common/table/TableSchemaResolver.java](https://codecov.io/gh/apache/incubator-hudi/pull/1559/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL1RhYmxlU2NoZW1hUmVzb2x2ZXIuamF2YQ==) | `56.71% <70.58%> (-5.47%)` | `0.00 <0.00> (ø)` | |
   | [...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1559/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==) | `83.59% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...in/java/org/apache/hudi/hive/HoodieHiveClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1559/diff?src=pr&el=tree#diff-aHVkaS1oaXZlLXN5bmMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGl2ZS9Ib29kaWVIaXZlQ2xpZW50LmphdmE=) | `74.63% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=footer). Last update [06dae30...bbf94c0](https://codecov.io/gh/apache/incubator-hudi/pull/1559?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r417020450



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -178,6 +193,17 @@ public Schema convertParquetSchemaToAvro(MessageType parquetSchema) {
     return avroSchemaConverter.convert(parquetSchema);
   }
 
+  /**
+   * Convert a avro scheme to the parquet format.
+   *
+   * @param schema The avro schema to convert
+   * @return The converted parquet schema
+   */
+  public MessageType convertAvroSchemaToParquet(Schema schema) {

Review comment:
       There is nothing really that we can re-use from ParquetUtils for the purpose of this PR. The APIs in ParquetUtils class accept a file path from which to read. However, here it first needs to go through the active timeline and find out the file path and then read the schema. The reading from file functions of this class can internally re-use some of the APIs from ParquetUtils but I don't think we should touch it as part of this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r421851356



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -178,6 +193,17 @@ public Schema convertParquetSchemaToAvro(MessageType parquetSchema) {
     return avroSchemaConverter.convert(parquetSchema);
   }
 
+  /**
+   * Convert a avro scheme to the parquet format.
+   *
+   * @param schema The avro schema to convert
+   * @return The converted parquet schema
+   */
+  public MessageType convertAvroSchemaToParquet(Schema schema) {

Review comment:
       the current code change is fine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r417019322



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -145,23 +146,37 @@ public MessageType getDataSchema() throws Exception {
    * @return Avro schema for this table
    * @throws Exception
    */
-  public Schema getTableSchema() throws Exception {
-    return convertParquetSchemaToAvro(getDataSchema());
+  public Schema getTableSchemaInAvroFormat() throws Exception {

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r416891281



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -145,23 +146,37 @@ public MessageType getDataSchema() throws Exception {
    * @return Avro schema for this table
    * @throws Exception
    */
-  public Schema getTableSchema() throws Exception {
-    return convertParquetSchemaToAvro(getDataSchema());
+  public Schema getTableSchemaInAvroFormat() throws Exception {
+    Option<Schema> schemaFromCommitMetadata = getTableSchemaFromCommitMetadata();
+    return schemaFromCommitMetadata.isPresent() ? schemaFromCommitMetadata.get() :
+           convertParquetSchemaToAvro(getDataSchema());
+  }
+
+  /**
+   * Gets the schema for a hoodie table in Parquet format.
+   *
+   * @return Parquet schema for the table
+   * @throws Exception
+   */
+  public MessageType getTableSchemaInParquetFormat() throws Exception {
+    Option<Schema> schemaFromCommitMetadata = getTableSchemaFromCommitMetadata();
+    return schemaFromCommitMetadata.isPresent() ? convertAvroSchemaToParquet(schemaFromCommitMetadata.get()) :
+           getDataSchema();
   }
 
   /**
    * Gets the schema for a hoodie table in Avro format from the HoodieCommitMetadata of the last commit.
    *
    * @return Avro schema for this table
-   * @throws Exception
    */
-  public Schema getTableSchemaFromCommitMetadata() throws Exception {
+  private Option<Schema> getTableSchemaFromCommitMetadata() {
     try {
       HoodieTimeline timeline = metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
       byte[] data = timeline.getInstantDetails(timeline.lastInstant().get()).get();
       HoodieCommitMetadata metadata = HoodieCommitMetadata.fromBytes(data, HoodieCommitMetadata.class);
       String existingSchemaStr = metadata.getMetadata(HoodieCommitMetadata.SCHEMA_KEY);
-      return new Schema.Parser().parse(existingSchemaStr);
+      return StringUtils.isNullOrEmpty(existingSchemaStr) ? Option.empty() :

Review comment:
       On a related note :  As we are start to rely on avro schema to be present in commit-metadata, we should store avro-schema as first-level entity in commit metadata instead of storing it in extra-metadata map and handle upgrade-downgrade (Added https://jira.apache.org/jira/browse/HUDI-844) 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -145,23 +146,37 @@ public MessageType getDataSchema() throws Exception {
    * @return Avro schema for this table
    * @throws Exception
    */
-  public Schema getTableSchema() throws Exception {
-    return convertParquetSchemaToAvro(getDataSchema());
+  public Schema getTableSchemaInAvroFormat() throws Exception {
+    Option<Schema> schemaFromCommitMetadata = getTableSchemaFromCommitMetadata();
+    return schemaFromCommitMetadata.isPresent() ? schemaFromCommitMetadata.get() :
+           convertParquetSchemaToAvro(getDataSchema());
+  }
+
+  /**
+   * Gets the schema for a hoodie table in Parquet format.
+   *
+   * @return Parquet schema for the table
+   * @throws Exception
+   */
+  public MessageType getTableSchemaInParquetFormat() throws Exception {

Review comment:
       You can introduce a getTableAvroSchemaFromDataFile to return in avro format. 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -178,6 +193,17 @@ public Schema convertParquetSchemaToAvro(MessageType parquetSchema) {
     return avroSchemaConverter.convert(parquetSchema);
   }
 
+  /**
+   * Convert a avro scheme to the parquet format.
+   *
+   * @param schema The avro schema to convert
+   * @return The converted parquet schema
+   */
+  public MessageType convertAvroSchemaToParquet(Schema schema) {

Review comment:
       Please check ParquetUtils class for similar APIs




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r416879633



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -145,23 +146,37 @@ public MessageType getDataSchema() throws Exception {
    * @return Avro schema for this table
    * @throws Exception
    */
-  public Schema getTableSchema() throws Exception {
-    return convertParquetSchemaToAvro(getDataSchema());
+  public Schema getTableSchemaInAvroFormat() throws Exception {

Review comment:
       Rename to just getTableAvroSchema ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r417019457



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -145,23 +146,37 @@ public MessageType getDataSchema() throws Exception {
    * @return Avro schema for this table
    * @throws Exception
    */
-  public Schema getTableSchema() throws Exception {
-    return convertParquetSchemaToAvro(getDataSchema());
+  public Schema getTableSchemaInAvroFormat() throws Exception {
+    Option<Schema> schemaFromCommitMetadata = getTableSchemaFromCommitMetadata();
+    return schemaFromCommitMetadata.isPresent() ? schemaFromCommitMetadata.get() :
+           convertParquetSchemaToAvro(getDataSchema());
+  }
+
+  /**
+   * Gets the schema for a hoodie table in Parquet format.
+   *
+   * @return Parquet schema for the table
+   * @throws Exception
+   */
+  public MessageType getTableSchemaInParquetFormat() throws Exception {

Review comment:
       Added.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on a change in pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#discussion_r417020636



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -145,23 +146,37 @@ public MessageType getDataSchema() throws Exception {
    * @return Avro schema for this table
    * @throws Exception
    */
-  public Schema getTableSchema() throws Exception {
-    return convertParquetSchemaToAvro(getDataSchema());
+  public Schema getTableSchemaInAvroFormat() throws Exception {
+    Option<Schema> schemaFromCommitMetadata = getTableSchemaFromCommitMetadata();
+    return schemaFromCommitMetadata.isPresent() ? schemaFromCommitMetadata.get() :
+           convertParquetSchemaToAvro(getDataSchema());
+  }
+
+  /**
+   * Gets the schema for a hoodie table in Parquet format.
+   *
+   * @return Parquet schema for the table
+   * @throws Exception
+   */
+  public MessageType getTableSchemaInParquetFormat() throws Exception {
+    Option<Schema> schemaFromCommitMetadata = getTableSchemaFromCommitMetadata();
+    return schemaFromCommitMetadata.isPresent() ? convertAvroSchemaToParquet(schemaFromCommitMetadata.get()) :
+           getDataSchema();
   }
 
   /**
    * Gets the schema for a hoodie table in Avro format from the HoodieCommitMetadata of the last commit.
    *
    * @return Avro schema for this table
-   * @throws Exception
    */
-  public Schema getTableSchemaFromCommitMetadata() throws Exception {
+  private Option<Schema> getTableSchemaFromCommitMetadata() {
     try {
       HoodieTimeline timeline = metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
       byte[] data = timeline.getInstantDetails(timeline.lastInstant().get()).get();
       HoodieCommitMetadata metadata = HoodieCommitMetadata.fromBytes(data, HoodieCommitMetadata.class);
       String existingSchemaStr = metadata.getMetadata(HoodieCommitMetadata.SCHEMA_KEY);
-      return new Schema.Parser().parse(existingSchemaStr);
+      return StringUtils.isNullOrEmpty(existingSchemaStr) ? Option.empty() :

Review comment:
       Agreed. This would be a good addition and make it cleaner.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] umehrot2 commented on pull request #1559: [HUDI-838] Support schema from HoodieCommitMetadata for HiveSync

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on pull request #1559:
URL: https://github.com/apache/incubator-hudi/pull/1559#issuecomment-625491869


   > @umehrot2 : There needs to be a followup task to reduce the extra-hop in generating Hive schema. https://issues.apache.org/jira/browse/HUDI-865
   > 
   > Otherwise, looks good. Please rebase and resolve conflicts. I will take one more pass after that.
   
   Yeah reducing the extra hop can be a good optimization. If no one else picks it by then, I can take it after our rfc-12 work.
   
   Resolved conflicts and pushed a new version. @bvaradar 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org