You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/14 21:19:40 UTC

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

afilipchik opened a new pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation
URL: https://github.com/apache/incubator-hudi/pull/1514
 
 
   ## What is the purpose of the pull request
   
   Avro schema that is generated from Spark Dataframe (after SQL transformation) has incorrect defaults. Order of types is wrong ("null" must be first) and default is not set. If makes every schema change incompatible, which is only affecting compactions. 
   
   This one is a reference PR. The solution is a bit hacky as I modify schema in place using reflection, there is probably a safer way to rebuild the schema. 
   
   ## Brief change log
   
   *(for example:)*
     - Added HoodieAvroUtils.rewriteIncorrectDefaults
     - Updated AvroConversionHelper and AvroConversionUtils to use it after SchemaConverters.toAvroType
   
   ## Verify this pull request
   
   Added test to HoodieAvroUtilsTest to verify rewriter works properly.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] hudi-bot commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-961587438


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3baebd032231974a7e7d9410b5bfeb879c9790b1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-914631306


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3baebd032231974a7e7d9410b5bfeb879c9790b1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-914631306






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-961587438


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3baebd032231974a7e7d9410b5bfeb879c9790b1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-961587438


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3baebd032231974a7e7d9410b5bfeb879c9790b1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-914631306


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3baebd032231974a7e7d9410b5bfeb879c9790b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3baebd032231974a7e7d9410b5bfeb879c9790b1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-683870135


   We are thinking about moving the spark datasource path to the native writer in the next release. So punting on this till then 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] afilipchik commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
afilipchik commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-642303533


   so, what should i do? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-642831847


   @afilipchik I am pausing coz diverging too much from spark-avro is a maintenance headache.. Do we try to upstream these to spark directly? Probably a better path? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-751319810


   @vinothchandar : do you think we need to pick this up sometime. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-633348943


   May be we can skip the row to Avro conversion in the long run altogether. For now I suggest we take the approach with minimal maintenance overhead.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
bvaradar commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629480819


   Also pinging @umehrot2 to get your help in reviewing this as you are familiar with this part.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629554509


   @bvaradar @vinothchandar : adding null and default logic looks good to me. Do you folks suggest to create a new Schema altogether to have a neat solution or do it in place with reflection as the patch does as of now. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] xushiyan commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
xushiyan commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-628915411


   @afilipchik The master codebase has been migrated to JUnit 5. Please kindly upgrade the usage to Junit 5 APIs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] umehrot2 commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629568357


   @afilipchik Seems like spark-avro schema convertor itself generates incorrect schema when we want to have **default value** as **null**. Is that the main concern addressed in this PR ? If thats the case, we should avoid using the spark library for conversion all together, and have in-house logic to generate correct schema in the first place.
   
   Also, under what scenario do we see failures because of this. Would be good to have a test case that fails currently for better understanding.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] xushiyan edited a comment on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
xushiyan edited a comment on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-628915411


   @afilipchik The master codebase has been migrated to JUnit 5. Please kindly rebase and update the usage to Junit 5 APIs where applicable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1514: [HUDI-774] [WIP] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-617540306


   cc @umehrot2  can you please help review this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] afilipchik commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
afilipchik commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-632839321


   @umehrot2 yep, it is attempt to fix schema generated by spark-avro. Moving generation in house makes sense, but, if I recall correctly, the issue is not coming from spark itself but from underlying library they are using. So, it can be a bit of work to rewrite it. 
   
   On the test case -> incoming dataset is transformed using Spark Sql with schema derived from the query result (NullTargetConverter). Then we add new field to the output, write a batch and run a compaction. At this point new schema can't be used to read old data as it will fail on new non default fields. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] nsivabalan commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629640734


   thanks Udit. I have no idea why I didn't review it fully. I just reviewed the HoodieAvroConversionUtils and thought there will be a follow up PR to integrate and write tests. my bad. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org