You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/18 09:36:10 UTC

[GitHub] [incubator-hudi] yihua opened a new pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

yihua opened a new pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246
 
 
   ## What is the purpose of the pull request
   
   This PR addresses [HUDI-552](https://issues.apache.org/jira/browse/HUDI-552).  When using the `FilebasedSchemaProvider` to provide the source/target schema in Avro, while ingesting data with the same columns from `RowSource`, the DeltaStreamer failed.  The root cause is that when writing parquet files in Spark, all fields are automatically converted to be nullable for compatibility reasons.  If the source Avro schema has non-null fields, `AvroConversionUtils.createRdd` still uses the schema from the Dataframe to convert the Row to Avro record, resulting in a different schema (only nullability difference).
   
   To fix this issue, the Avro schema, if exists, is passed to the conversion function to reconstruct the correct StructType for conversion.
   
   ## Brief change log
   
     - Passed the Avro schema to `createRdd` to generate the correct StructType for conversion in `DeltaSync.readFromSource` and `AvroConversionUtils.createRdd`
     - Added new tests to make sure the logic is correct (before this schema fix some of the new tests failed)
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
     - Added tests in `TestHoodieDeltaStreamer` to test the `HoodieDeltaStreamer` with `ParquetDFSSource` under different configurations
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yihua commented on issue #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246#issuecomment-575950765
 
 
   @vinothchandar I fixed the tests.  Locally they passed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yihua commented on a change in pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

Posted by GitBox <gi...@apache.org>.
yihua commented on a change in pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246#discussion_r368785363
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
 ##########
 @@ -620,6 +636,62 @@ public void testDistributedTestDataSource() {
     Assert.assertEquals(1000, c);
   }
 
+  private static void prepareParquetDFSFiles(int numRecords) throws IOException {
+    String path = PARQUET_SOURCE_ROOT + "/1.parquet";
+    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+    Helpers.saveParquetToDFS(Helpers.toGenericRecords(
+        dataGenerator.generateInserts("000", numRecords), dataGenerator), new Path(path));
+  }
+
+  private void prepareParquetDFSSource(boolean useSchemaProvider, boolean hasTransformer) throws IOException {
+    // Properties used for testing delta-streamer with Parquet source
+    TypedProperties parquetProps = new TypedProperties();
+    parquetProps.setProperty("include", "base.properties");
+    parquetProps.setProperty("hoodie.datasource.write.recordkey.field", "_row_key");
+    parquetProps.setProperty("hoodie.datasource.write.partitionpath.field", "not_there");
+    if (useSchemaProvider) {
+      parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file", dfsBasePath + "/source.avsc");
+      if (hasTransformer) {
+        parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file", dfsBasePath + "/target.avsc");
 
 Review comment:
   Good catch.  I've found and fixed that in #1165.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246#discussion_r368772209
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
 ##########
 @@ -620,6 +636,62 @@ public void testDistributedTestDataSource() {
     Assert.assertEquals(1000, c);
   }
 
+  private static void prepareParquetDFSFiles(int numRecords) throws IOException {
+    String path = PARQUET_SOURCE_ROOT + "/1.parquet";
+    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+    Helpers.saveParquetToDFS(Helpers.toGenericRecords(
+        dataGenerator.generateInserts("000", numRecords), dataGenerator), new Path(path));
+  }
+
+  private void prepareParquetDFSSource(boolean useSchemaProvider, boolean hasTransformer) throws IOException {
+    // Properties used for testing delta-streamer with Parquet source
+    TypedProperties parquetProps = new TypedProperties();
+    parquetProps.setProperty("include", "base.properties");
+    parquetProps.setProperty("hoodie.datasource.write.recordkey.field", "_row_key");
+    parquetProps.setProperty("hoodie.datasource.write.partitionpath.field", "not_there");
+    if (useSchemaProvider) {
+      parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file", dfsBasePath + "/source.avsc");
+      if (hasTransformer) {
+        parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file", dfsBasePath + "/target.avsc");
 
 Review comment:
   is the key to this property right? Isn't ".....target.schema.file" ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

Posted by GitBox <gi...@apache.org>.
vinothchandar merged pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services