You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/10 18:17:06 UTC

[GitHub] [hudi] rubenssoto opened a new issue #2563: [Feature Request] Full Schema Evolution

rubenssoto opened a new issue #2563:
URL: https://github.com/apache/hudi/issues/2563


   Hello,
   
   If have a better place to ask for a feature, please let me know.
   
   
   https://issues.apache.org/jira/projects/HUDI/issues/HUDI-1540?filter=updatedrecently
   
   I am very impressed with the number of new features that Hudi will have in the future, you are doing a great job, thank you so much!!!
   
   If I am wrong, please correct me, but Hudi today doesn't have full schema evolution, for example, if my destination table has a column that my dataframe doesn't have, the upsert will fail. I have some cases that sometimes a column exists and sometimes doesn't.
   
   thank you!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-777189845


   @nsivabalan : Can you kindly track this request ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-789181676


   I will try to simulate with a simple script and back to you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-784284438


   @rubenssoto : sorry bit confusing. Lets simplify this.
   Lets say there are only two fields. F1 and F2 (commit_version). 
   Is this your scenario. 
   Existing dataset in hudi has only F1. 
   and your new batch of ingestion  has schema F1, F2? 
   Can you please clarify. 
   Basically whats the schema of original hudi dataset. 
   And whats the schema of new batch of write. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-781391686


   thanks. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-783862335


   `Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key 7176859 from old file s3://dl/courier_api/customer_address/3ee388f2-fa45-437a-a279-d9e3e3369bbd-0_9-137-2635_20210223033155.parquet to new file s3://ld/courier_api/customer_address/3ee388f2-fa45-437a-a279-d9e3e3369bbd-0_9-377-7189_20210223035129.parquet with writerSchema {
     "type" : "record",
     "name" : "customer_address_record",
     "namespace" : "hoodie.customer_address",
     "fields" : [ {
       "name" : "_hoodie_commit_time",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_commit_seqno",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_record_key",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_partition_path",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "_hoodie_file_name",
       "type" : [ "null", "string" ],
       "doc" : "",
       "default" : null
     }, {
       "name" : "Op",
       "type" : [ "string", "null" ]
     }, {
       "name" : "LineCreatedTimestamp",
       "type" : [ "string", "null" ]
     }, {
       "name" : "created_date",
       "type" : [ {
         "type" : "long",
         "logicalType" : "timestamp-micros"
       }, "null" ]
     }, {
       "name" : "updated_date",
       "type" : [ {
         "type" : "long",
         "logicalType" : "timestamp-micros"
       }, "null" ]
     }, {
       "name" : "id",
       "type" : [ "int", "null" ]
     }, {
       "name" : "address_type",
       "type" : [ "string", "null" ]
     }, {
       "name" : "name",
       "type" : [ "string", "null" ]
     }, {
       "name" : "customer_email",
       "type" : [ "string", "null" ]
     }, {
       "name" : "street",
       "type" : [ "string", "null" ]
     }, {
       "name" : "number",
       "type" : [ "string", "null" ]
     }, {
       "name" : "address_line2",
       "type" : [ "string", "null" ]
     }, {
       "name" : "city",
       "type" : [ "string", "null" ]
     }, {
       "name" : "province",
       "type" : [ "string", "null" ]
     }, {
       "name" : "zipcode",
       "type" : [ "string", "null" ]
     }, {
       "name" : "country",
       "type" : [ "string", "null" ]
     }, {
       "name" : "neighborhood",
       "type" : [ "string", "null" ]
     }, {
       "name" : "latitude",
       "type" : [ "double", "null" ]
     }, {
       "name" : "longitude",
       "type" : [ "double", "null" ]
     }, {
       "name" : "commit_version",
       "type" : "long"
     }, {
       "name" : "_hoodie_is_deleted",
       "type" : "boolean"
     } ]
   }
   	at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:256)
   	at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:122)
   	at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:112)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	... 3 more
   Caused by: java.lang.RuntimeException: Null-value for required field: commit_version
   	at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
   	at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
   	at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
   	at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
   	at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:94)
   	at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:251)
   	... 8 more
   
   Driver stacktrace:
   	at jobs.TableProcessor.start(TableProcessor.scala:104)
   	at TableProcessorWrapper$.$anonfun$main$2(TableProcessorWrapper.scala:23)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
   	at scala.util.Success.$anonfun$map$1(Try.scala:255)
   	at scala.util.Success.map(Try.scala:213)
   	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
   	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
   	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
   	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
   	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
   	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
   	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
   	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
   	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   
   	 ApplicationMaster host: ip-10-0-53-212.us-west-2.compute.internal
   	 ApplicationMaster RPC port: 41723
   	 queue: default
   	 start time: 1614052265461
   	 final status: FAILED
   	 tracking URL: http://ip-10-0-49-168.us-west-2.compute.internal:20888/proxy/application_1613496813774_2805/
   	 user: hadoop`
   
   
   
   @nsivabalan I had this error, I have a table without the column commit_version(it is a column that I created), I add the column commit_version in my script and the new data try to update the old one.
   
   Is this problem is addressed too?
   
   Thank you so much.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-787195135


   Adding a new column should work w/ hudi. 
   https://gist.github.com/nsivabalan/dd604527bd5ad62a08272a34425f5fad
   Can you revisit if you are creating the new column w/ null value set or default value set. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-789156216


   cc @n3nash given you were looking into this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-784313085


   Yeah thats right...
   
   My existing dataset has only column F1, imagine that F1 is my primary key
   My new batch of data has two fields F1 and F2, in that new batch has inserts and updates
   
   With this scenario, I had that problem.
   
   @nsivabalan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-781034857


   yeah. we don't support full schema evolution as of today. But definitely its in our radar. https://issues.apache.org/jira/browse/HUDI-631 
   https://issues.apache.org/jira/browse/HUDI-1129
   Let us know if you need anything more that what is being tracked already. Or feel free to update the jiras. Feel free to close out this issue. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-784284438


   @rubenssoto : sorry bit confusing. Lets simplify this.
   Lets say there are only two fields. F1 and F2 (commit_version). 
   Is this your scenario. 
   Existing dataset in hudi has only F1. 
   and your new batch of ingestion  has schema F1, F2? 
   Can you please clarify. 
   Basically whats the schema of original hudi dataset. 
   And whats the schema of new batch of write. 
   Can you please clarify


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto closed issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
rubenssoto closed issue #2563:
URL: https://github.com/apache/hudi/issues/2563


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-781385127


   @nsivabalan good to know, I will close this ticket.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org