You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/10 18:17:06 UTC
[GitHub] [hudi] rubenssoto opened a new issue #2563: [Feature Request] Full Schema Evolution
rubenssoto opened a new issue #2563:
URL: https://github.com/apache/hudi/issues/2563
Hello,
If have a better place to ask for a feature, please let me know.
https://issues.apache.org/jira/projects/HUDI/issues/HUDI-1540?filter=updatedrecently
I am very impressed with the number of new features that Hudi will have in the future, you are doing a great job, thank you so much!!!
If I am wrong, please correct me, but Hudi today doesn't have full schema evolution, for example, if my destination table has a column that my dataframe doesn't have, the upsert will fail. I have some cases that sometimes a column exists and sometimes doesn't.
thank you!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-777189845
@nsivabalan : Can you kindly track this request ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-789181676
I will try to simulate with a simple script and back to you
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-784284438
@rubenssoto : sorry bit confusing. Lets simplify this.
Lets say there are only two fields. F1 and F2 (commit_version).
Is this your scenario.
Existing dataset in hudi has only F1.
and your new batch of ingestion has schema F1, F2?
Can you please clarify.
Basically whats the schema of original hudi dataset.
And whats the schema of new batch of write.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-781391686
thanks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-783862335
`Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge old record into new file for key 7176859 from old file s3://dl/courier_api/customer_address/3ee388f2-fa45-437a-a279-d9e3e3369bbd-0_9-137-2635_20210223033155.parquet to new file s3://ld/courier_api/customer_address/3ee388f2-fa45-437a-a279-d9e3e3369bbd-0_9-377-7189_20210223035129.parquet with writerSchema {
"type" : "record",
"name" : "customer_address_record",
"namespace" : "hoodie.customer_address",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "Op",
"type" : [ "string", "null" ]
}, {
"name" : "LineCreatedTimestamp",
"type" : [ "string", "null" ]
}, {
"name" : "created_date",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
}, {
"name" : "updated_date",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
}, {
"name" : "id",
"type" : [ "int", "null" ]
}, {
"name" : "address_type",
"type" : [ "string", "null" ]
}, {
"name" : "name",
"type" : [ "string", "null" ]
}, {
"name" : "customer_email",
"type" : [ "string", "null" ]
}, {
"name" : "street",
"type" : [ "string", "null" ]
}, {
"name" : "number",
"type" : [ "string", "null" ]
}, {
"name" : "address_line2",
"type" : [ "string", "null" ]
}, {
"name" : "city",
"type" : [ "string", "null" ]
}, {
"name" : "province",
"type" : [ "string", "null" ]
}, {
"name" : "zipcode",
"type" : [ "string", "null" ]
}, {
"name" : "country",
"type" : [ "string", "null" ]
}, {
"name" : "neighborhood",
"type" : [ "string", "null" ]
}, {
"name" : "latitude",
"type" : [ "double", "null" ]
}, {
"name" : "longitude",
"type" : [ "double", "null" ]
}, {
"name" : "commit_version",
"type" : "long"
}, {
"name" : "_hoodie_is_deleted",
"type" : "boolean"
} ]
}
at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:256)
at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:122)
at org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:112)
at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.lang.RuntimeException: Null-value for required field: commit_version
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:94)
at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:251)
... 8 more
Driver stacktrace:
at jobs.TableProcessor.start(TableProcessor.scala:104)
at TableProcessorWrapper$.$anonfun$main$2(TableProcessorWrapper.scala:23)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
ApplicationMaster host: ip-10-0-53-212.us-west-2.compute.internal
ApplicationMaster RPC port: 41723
queue: default
start time: 1614052265461
final status: FAILED
tracking URL: http://ip-10-0-49-168.us-west-2.compute.internal:20888/proxy/application_1613496813774_2805/
user: hadoop`
@nsivabalan I had this error, I have a table without the column commit_version(it is a column that I created), I add the column commit_version in my script and the new data try to update the old one.
Is this problem is addressed too?
Thank you so much.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-787195135
Adding a new column should work w/ hudi.
https://gist.github.com/nsivabalan/dd604527bd5ad62a08272a34425f5fad
Can you revisit if you are creating the new column w/ null value set or default value set.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-789156216
cc @n3nash given you were looking into this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-784313085
Yeah thats right...
My existing dataset has only column F1, imagine that F1 is my primary key
My new batch of data has two fields F1 and F2, in that new batch has inserts and updates
With this scenario, I had that problem.
@nsivabalan
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-781034857
yeah. we don't support full schema evolution as of today. But definitely its in our radar. https://issues.apache.org/jira/browse/HUDI-631
https://issues.apache.org/jira/browse/HUDI-1129
Let us know if you need anything more that what is being tracked already. Or feel free to update the jiras. Feel free to close out this issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-784284438
@rubenssoto : sorry bit confusing. Lets simplify this.
Lets say there are only two fields. F1 and F2 (commit_version).
Is this your scenario.
Existing dataset in hudi has only F1.
and your new batch of ingestion has schema F1, F2?
Can you please clarify.
Basically whats the schema of original hudi dataset.
And whats the schema of new batch of write.
Can you please clarify
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] rubenssoto closed issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
rubenssoto closed issue #2563:
URL: https://github.com/apache/hudi/issues/2563
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2563: [Feature Request] Full Schema Evolution
Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2563:
URL: https://github.com/apache/hudi/issues/2563#issuecomment-781385127
@nsivabalan good to know, I will close this ticket.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org