You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/02 19:41:34 UTC

[GitHub] [incubator-hudi] symfrog opened a new issue #1480: [SUPPORT]

symfrog opened a new issue #1480: [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1480
 
 
   **Describe the problem you faced**
   
   Is there any way to retain the commit instant time for records when using delta streamer with a Hudi table source? 
   
   I took a look at the code, and it does not seem possible. 
   
   I am trying to migrate tables, but would like downstream clients to be able to continue doing incremental pulls transparently using their existing instant time values after migration. 
   
   Is there any other way to achieve this? 
   
   It seems it might be possible when constructing a HoodieWriteClient directly and using startCommitWithTime (https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L867), would this be a viable route?
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-609466504
 
 
   @vinothchandar Yes the exporter tool can be used for this purpose, with some changes. It currently supports copying Hudi dataset as is. With this migration use case, we could extend the feature to include transformation when `--output-format hudi`, using a custom `Transformer`.
   
   Though MOR is a bit troublesome with log files conversions, we could start with COW tables support? Does this work for your case? @symfrog 
   
   As for splitting/merging usecases, something feasible as well; some more logic to implement for exporter to take multiple source/target paths. Also some efforts to support multiple datasets in `Transformer` interface.
   
   @vinothchandar Are my thoughts above aligned with yours?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] symfrog commented on issue #1480: [SUPPORT]

Posted by GitBox <gi...@apache.org>.
symfrog commented on issue #1480: [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608114543
 
 
   @bvaradar the purpose would be in the case of an unavoidable schema evolution that is not backward compatible, we would maintain the original tables for some period of time to allow for downstream clients to migrate to the new set of tables. 
   
   The new set of tables would be a transformation (e.g. rename columns) of the original tables. 
   
   However, we would like downstream clients to be able to use their instant values to continue to do incremental pulls without receiving data they have already processed when they switch over to the new tables (conforming to the new schema). 
   
   The new tables would be created during an initialization process to ingest all the data from the old tables and transform it to the new schema. After this initialization process, we would like the instant timestamps to be the same in the new target tables after the transformation so that downstream clients can continue to use their existing instant values while performing incremental pull queries. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608650572
 
 
   @vinothchandar yes, exactly, some schema evolution operations may also involve the splitting or merging of tables
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1480: [SUPPORT]

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1480: [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608081240
 
 
   It is not clear what is the advantage of retaining instant timestamps between pipelines. Can you elaborate more on what the issue is when migrating tables ? 
   
   To your question related to code, it is not just  commit which works on instant times. Other background actions like compactions, cleaning uses internally generated timestamps. So, those needs to be handled too. My suggestion is to really make sure if this way of operating has valid use-case and is really needed ? 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-610745060
 
 
   @bvaradar Yes, I marked 767 for 0.6.0. I'll put 768 on waiting list at the moment 😄 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-610106545
 
 
   @vinothchandar filed
   https://jira.apache.org/jira/browse/HUDI-767
   https://jira.apache.org/jira/browse/HUDI-768
   
   @bvaradar @vinothchandar ok to close this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
 
 
   >>we would like the instant timestamps to be the same in the new target tables after the transformation so that downstream clients can continue to use their existing instant values while performing incremental pull queries. 
   
   IIUC the current initialization process hands you a single commit for the first ingest.. but you basically want a physical copy of the old data, as the new data , with just renamed fields/new schema.. In general, this may be worth adding support for in the new exporter tool cc @xushiyan ... wdyt? essentially, something that will preserve file names and just transform the data. 
   
   For now, even if you create those commit timeline files yourself in `.hoodie`, it may not work since the metadata inside will point to files that no longer exist in the new table..  Here's an approach that could work.. Writing a small program, that will 
   
   - First copy the `.hoodie` folder to new table location
   - Then list all files (directly using fs.listStatus()) and filter them such that their commit time < latest commit time in the `.hoodie` folder you copied above
   - Read all files out using AvroParquetReader to get RDD[GenericRecord] (if it's MOR, we need more work), do your schema adjusting to derive a new RDD[GenericRecord]
   - Write this out using HoodieAvroParquetWriter back into the same file names.. 
   
   Essentially, you will have the same file names and same timline (.hoodie) metadata, just with different schema.. 
   
   Let's also wait to hear from @xushiyan . may be the exporter tool could be reused here
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-609914596
 
 
   >we could start with COW tables support?
   
   sg. as long as we throw a loud exception saying MOR + transformer is not supported :) 
   
   Time to file a JIRA?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-609914596
 
 
   >we could start with COW tables support?
   sg. as long as we throw a loud exception saying MOR + transformer is not supported :) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-609489074
 
 
   @xushiyan Yes, thanks, that would work. I am using COW for the tables.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar closed issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-610133949
 
 
   Thanks @xushiyan . If you are planning to have the jiras done in 0.6.0, can you mark the fix versions accordingly. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services