You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/18 02:33:41 UTC

[GitHub] [hudi] LinMingQiang opened a new issue, #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

LinMingQiang opened a new issue, #7691:
URL: https://github.com/apache/hudi/issues/7691

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Initialize hudi table with spark
   2. Write using flink
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 3.x
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   Caused by: org.apache.hudi.exception.HoodieException: Failed schema compatibility check for writerSchema :{"type":"record","name":"record","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"string"},{"name":"name","type":["null","string"],"default":null},{"name":"age","type":["null","long"],"default":null},{"name":"ts","type":["null","long"],"default":null},{"name":"par","type":["null","string"],"default":null}]}, table schema :{"type":"record","name":"ittestsparkwritethenflinkwrite_record","namespace":"hoodie.ittestsparkwritethenflinkwrite","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"
 doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"string"},{"name":"name","type":["null","string"],"default":null},{"name":"age","type":["null","long"],"default":null},{"name":"ts","type":["null","long"],"default":null},{"name":"par","type":["null","string"],"default":null}]}, base path :file:/Users/hunter/workspace/lmq/hudi-master-debug/hudi-debug/hudi-debug-flink/target/ITTestSparkWriteThenFlinkWrite
   	at org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:812)
   	at org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:818)
   	... 19 more
   
   
   ```schema.namespace  conflicts arising```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TengHuo commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by GitBox <gi...@apache.org>.

TengHuo commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1386540907

   Is this the same issue as this one? It was an Avro schema namespace inconsistent issue we found before.
   https://github.com/apache/hudi/issues/7284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TengHuo commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by "TengHuo (via GitHub)" <gi...@apache.org>.

TengHuo commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1413118913

   > > @TengHuo @danny0405 What's the followup here? If it's a bug, is it going to be fixed in 0.13.0?
   > > I fixed it in this pr #7694
   
   @LinMingQiang The PR should fix the compatibility check error only. When resolving Avro schema, the generator will compare the full name of schema if the type is `FIXED`, `ENUM`, `ARRAY`, `MAP`, `RECORD` and `UNION`. If the full name is not matched, the generator may return a `Symbol.error`. Then it will be throw as an `AvroTypeException` in `ResolvingDecoder#doAction`.
   
   Code reference: `org.apache.avro.io.parsing.ResolvingGrammarGenerator#generate(Schema writer, Schema reader, Map<LitS, Symbol> seen)`
   
   ```java
     public Symbol generate(Schema writer, Schema reader,
                                   Map<LitS, Symbol> seen) throws IOException
     {
       final Schema.Type writerType = writer.getType();
       final Schema.Type readerType = reader.getType();
   
       if (writerType == readerType) {
         switch (writerType) {
         case NULL:
           return Symbol.NULL;
         case BOOLEAN:
           return Symbol.BOOLEAN;
         case INT:
           return Symbol.INT;
         case LONG:
           return Symbol.LONG;
         case FLOAT:
           return Symbol.FLOAT;
         case DOUBLE:
           return Symbol.DOUBLE;
         case STRING:
           return Symbol.STRING;
         case BYTES:
           return Symbol.BYTES;
         case FIXED:
           if (writer.getFullName().equals(reader.getFullName())
               && writer.getFixedSize() == reader.getFixedSize()) {
             return Symbol.seq(Symbol.intCheckAction(writer.getFixedSize()),
                 Symbol.FIXED);
           }
           break;
   
         case ENUM:
           if (writer.getFullName() == null
                   || writer.getFullName().equals(reader.getFullName())) {
             return Symbol.seq(mkEnumAdjust(writer.getEnumSymbols(),
                     reader.getEnumSymbols()), Symbol.ENUM);
           }
           break;
   
         case ARRAY:
           return Symbol.seq(Symbol.repeat(Symbol.ARRAY_END,
                   generate(writer.getElementType(),
                   reader.getElementType(), seen)),
               Symbol.ARRAY_START);
   
         case MAP:
           return Symbol.seq(Symbol.repeat(Symbol.MAP_END,
                   generate(writer.getValueType(),
                   reader.getValueType(), seen), Symbol.STRING),
               Symbol.MAP_START);
         case RECORD:
           return resolveRecords(writer, reader, seen);
         case UNION:
           return resolveUnion(writer, reader, seen);
         default:
           throw new AvroTypeException("Unkown type for schema: " + writerType);
         }
       } else {  // writer and reader are of different types
         if (writerType == Schema.Type.UNION) {
           return resolveUnion(writer, reader, seen);
         }
   
         switch (readerType) {
         case LONG:
           switch (writerType) {
           case INT:
             return Symbol.resolve(super.generate(writer, seen), Symbol.LONG);
           }
           break;
   
         case FLOAT:
           switch (writerType) {
           case INT:
           case LONG:
             return Symbol.resolve(super.generate(writer, seen), Symbol.FLOAT);
           }
           break;
   
         case DOUBLE:
           switch (writerType) {
           case INT:
           case LONG:
           case FLOAT:
             return Symbol.resolve(super.generate(writer, seen), Symbol.DOUBLE);
           }
           break;
   
         case BYTES:
           switch (writerType) {
           case STRING:
             return Symbol.resolve(super.generate(writer, seen), Symbol.BYTES);
           }
           break;
   
         case STRING:
           switch (writerType) {
           case BYTES:
             return Symbol.resolve(super.generate(writer, seen), Symbol.STRING);
           }
           break;
   
         case UNION:
           int j = bestBranch(reader, writer, seen);
           if (j >= 0) {
             Symbol s = generate(writer, reader.getTypes().get(j), seen);
             return Symbol.seq(Symbol.unionAdjustAction(j, s), Symbol.UNION);
           }
           break;
         case NULL:
         case BOOLEAN:
         case INT:
         case ENUM:
         case ARRAY:
         case MAP:
         case RECORD:
         case FIXED:
           break;
         default:
           throw new RuntimeException("Unexpected schema type: " + readerType);
         }
       }
       return Symbol.error("Found " + writer.getFullName()
                           + ", expecting " + reader.getFullName());
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TengHuo commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by "TengHuo (via GitHub)" <gi...@apache.org>.

TengHuo commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1413224065

   > @TengHuo @danny0405 What's the followup here? If it's a bug, is it going to be fixed in 0.13.0?
   
   Hi @codope, think it is a inconsistent behaviour issue between Spark and Flink. We may need to fix it in Spark and Flink side at the same time. I added some detail in PR: https://github.com/apache/hudi/pull/7307#issuecomment-1413220811
   
   Do you have any suggestions about how to fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] TengHuo commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by GitBox <gi...@apache.org>.

TengHuo commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1386565809

   > Not the same. The current issue is the schema compatibility problem between Flink and Spark.
   
   Yeah, not the same, but I think they are similar. In #7284, we found it uses a pattern to build Avro schema namespace in Spark side, e.g. the namespace of writer schema is `"namespace": "hoodie.test_mor_tab"` ("hoodie" is a prefix, "test_mor_tab" is our test Hudi table name), but the reader schema use a constant name `"name": "Record"`, which causes inconsistent issue. Detail here: https://github.com/apache/hudi/issues/7284#issuecomment-1324899843
   
   As Danny mentioned above, it is using a constant namespace named 'record' in Flink side, looks like it will cause schema mismatch issue as well if Avro schema is generated by Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] LinMingQiang commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by "LinMingQiang (via GitHub)" <gi...@apache.org>.

LinMingQiang commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1412120640

   > @TengHuo @danny0405 What's the followup here? If it's a bug, is it going to be fixed in 0.13.0?
   i have fix in this pr https://github.com/apache/hudi/pull/7694


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1386486937

   Okey, seems a bug, flink uses the constant namespace named 'record' when generating the avro schema, does that cause the im-compatibility? Can you fire a JIRA to address and fix this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] LinMingQiang commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by GitBox <gi...@apache.org>.

LinMingQiang commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1386558629

   Not the same. The current issue is the schema compatibility problem between Flink and Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #7691: [SUPPORT] Flink's schema conflicts with spark's schema.

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on issue #7691:
URL: https://github.com/apache/hudi/issues/7691#issuecomment-1411986047

   @TengHuo @danny0405 What's the followup here? If it's a bug, is it going to be fixed in 0.13.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org