You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/05 11:00:34 UTC

[GitHub] [hudi] RajasekarSribalan opened a new issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

RajasekarSribalan opened a new issue #1794:
URL: https://github.com/apache/hudi/issues/1794


   **Describe the problem you faced**
   
   Hi, We are doing upserts and deletes in Hudi COW tables. It is Spark streaming app which reads data from Kafka and upsert it in Hudi. Below is the psuedocode
   
   1. var df=  read kafka
   2. df.persist() // we persist the dataframe because we can have both upsert and delete records in single dataframe. SO filter them based or U or D
   3. Filter only upsert records and insert it in hudi
   4. Filter only Hudi records and insert it in Hudi
   5. df.unpersist()
   
   While doing delete, it is throwing below error. My Question, should we need to sync with Hive even for delete operation?Pls confirm.
   
   20/07/05 10:19:20 ERROR hive.HiveSyncTool: Got runtime exception when hive syncing
   18039 java.lang.IllegalArgumentException: Could not find any data file written for commit [20200705101913__commit__COMPLETED], could not get schema for table /user/admin/hudi/users, Metadata       :HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, extraMetadata={ROLLING_STAT={
   18040   "partitionToRollingStats" : {
   18041     "" : {
   18042       "d398058e-f8f4-4772-9fcb-012318ac8f47-0" : {
   18043         "fileId" : "d398058e-f8f4-4772-9fcb-012318ac8f47-0",
   18044         "inserts" : 989333,
   18045         "upserts" : 11,
   18046         "deletes" : 0,
   18047         "totalInputWriteBytesToDisk" : 0,
   18048         "totalInputWriteBytesOnDisk" : 49443028
   18049       },
   18050       "eed1f67c-8c46-425f-b740-2e21b84c6f13-0" : {
   18051         "fileId" : "eed1f67c-8c46-425f-b740-2e21b84c6f13-0",
   18052         "inserts" : 1263360,
   18053         "upserts" : 16,
   18054         "deletes" : 0,
   18055         "totalInputWriteBytesToDisk" : 0,
   18056         "totalInputWriteBytesOnDisk" : 49672386
   18057       },
   18058       "e9f38e55-acf2-4bd2-b568-def7361f2f29-0" : {
   18059         "fileId" : "e9f38e55-acf2-4bd2-b568-def7361f2f29-0",
   18060         "inserts" : 946616,
   18061         "upserts" : 6,
   18062         "deletes" : 0,
   18063         "totalInputWriteBytesToDisk" : 0,
   18064         "totalInputWriteBytesOnDisk" : 45686395
   18065       },
   18066       "8a93afac-d60e-41bb-a3e1-edd793e2a932-0" : {
   18067         "fileId" : "8a93afac-d60e-41bb-a3e1-edd793e2a932-0",
   18068         "inserts" : 482202,
   18069         "upserts" : 0,
   18070         "deletes" : 0,
   18071         "totalInputWriteBytesToDisk" : 0,
   18072         "totalInputWriteBytesOnDisk" : 49744729
    
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * Spark version : CLoudera spark 2.2.0
   
   * Hive version : Cloudera Hive 1.1
   
   * Hadoop version :2.6
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :No
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-657723942


   Sure. I will try to reproduce. Will update in two days. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-655881102


   @nsivabalan  might be worth reproducing this in the docker environment?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-659617807


   I couldn't repro it locally (everything was ok). But in general, hive sync is required only if there are changes to partitions list (like new partitions added, removed, etc). When I tried, changes were reflected even w/o doing an hive sync. So, you are good to skip hive sync for those commits which has only deletes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #1794:
URL: https://github.com/apache/hudi/issues/1794


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-655211543


   `20200705101913__commit__COMPLETED` indicates the commit was completed successfully actually.. I think it's saying the commit file does not have a data file, from which it can fetch schema. .. 
   
   Do you mind sharing the `20200705101913.commit` file from timeline? @bhasudha  can you check if there are any changes in 0.5.3 around this. My understanding is that currently we are writing the schema together with the commit file, so it does not even have to look for a data file.. 
   
   @RajasekarSribalan if you are up for it, and easy to do, can you give master branch a shot?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan edited a comment on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan edited a comment on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-654588991


   @bhasudha  just an update, our jobs are not failing but we get this error for hard delete operation and below is the command that we use on dataframe for delete operation.
   
   My concern is, whether hudi is doing rollback because of the error? I hope it is not,  pls confirm.
   
   ERROR hive.HiveSyncTool: Got runtime exception when hive syncing
   18039 java.lang.IllegalArgumentException: Could not find any data file written for commit [20200705101913__commit__COMPLETED], could not get schema for table /user/admin/hudi/users, Metadata :HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, extraMetadata={ROLLING_STAT={
   18040 "partitionToRollingStats" : {
   18041 "" : {
   18042 "d398058e-f8f4-4772-9fcb-012318ac8f47-0" : {
   18043 "fileId" : "d398058e-f8f4-4772-9fcb-012318ac8f47-0",
   18044 "inserts" : 989333,
   
   deleteDataframe.write
                   .format("hudi")
                   .options(getQuickstartWriteConfigs)
                   .option(OPERATION_OPT_KEY, "delete")
                   .option(PRECOMBINE_FIELD_OPT_KEY, hudi_precombine_key)
                   .option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
                   .option(PARTITIONPATH_FIELD_OPT_KEY, "")
                   .option(TABLE_NAME, tablename)
                   .option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
                   .option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
                   .option(HIVE_URL_OPT_KEY, "jdbc:hive2://XXXXXX:10000/;principal=XXXX/XXXXXXX)
                   .option(HIVE_DATABASE_OPT_KEY, hudi_db)
                   .option(HIVE_TABLE_OPT_KEY, tablename)
                   .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[NonPartitionedExtractor].getName)
                   .option(PAYLOAD_CLASS_OPT_KEY, "org.apache.hudi.EmptyHoodieRecordPayload")
                   .mode(Append)
                   .save("/user/XXXXX/hudi/" + tablename)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-659617807


   I couldn't repro it locally (everything was ok). But in general, hive sync is required only if there are changes to partitions list (like new partitions added, removed, etc). When I tried a commit only w/ deletes, changes were reflected even w/o doing an hive sync. So, you are good to skip hive sync for those commits which has only deletes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-654249075


   @RajasekarSribalan  quick questions for clarity:
   
   - You intend to also take the deletes from Kafka and apply to Hudi right? Or is that strictly a transformation logic that you want to filter out in Spark Streaming before even loading into Hudi table?
   - If yes to above question, how are you doing delete - soft or hard delete ? https://hudi.apache.org/docs/writing_data.html#deletes 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-654281760


   @bhasudha 
    
   - We have transformation logic for deletes and hence we are filtering the delete records.
   
   - We are doing hard delete and "delete" is the operation we use in spark write operation for delete records


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-660027464


   please reopen if you need anything. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan edited a comment on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan edited a comment on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-655261425


   Thanks @vinothchandar @bhasudha . I ll try to fetch the commit file but as of now I have now
   disabled hive sync for delete operation and now I don't get this error at
   all? Do you have any comment on this?.
   
   As I said,when I enable hive sync for delete operation I get this error but
   not when I disable it.
   
   On Wed, 8 Jul 2020, 6:07 am vinoth chandar, <no...@github.com>
   wrote:
   
   > 20200705101913__commit__COMPLETED indicates the commit was completed
   > successfully actually.. I think it's saying the commit file does not have a
   > data file, from which it can fetch schema. ..
   >
   > Do you mind sharing the 20200705101913.commit file from timeline?
   > @bhasudha <https://github.com/bhasudha> can you check if there are any
   > changes in 0.5.3 around this. My understanding is that currently we are
   > writing the schema together with the commit file, so it does not even have
   > to look for a data file..
   >
   > @RajasekarSribalan <https://github.com/RajasekarSribalan> if you are up
   > for it, and easy to do, can you give master branch a shot?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hudi/issues/1794#issuecomment-655211543>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AFMO6I26H5UBJRHI5LMVJPDR2O5TBANCNFSM4OQ2GZNQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-654588991


   @bhasudha  just an update, our jobs are not failing but we get this error for hard delete operation and below is the command that we use on dataframe for delete operation.
   
   My concern is, whether hudi is doing rollback before the error? I hope it is not,  pls confirm.
   
   ERROR hive.HiveSyncTool: Got runtime exception when hive syncing
   18039 java.lang.IllegalArgumentException: Could not find any data file written for commit [20200705101913__commit__COMPLETED], could not get schema for table /user/admin/hudi/users, Metadata :HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, extraMetadata={ROLLING_STAT={
   18040 "partitionToRollingStats" : {
   18041 "" : {
   18042 "d398058e-f8f4-4772-9fcb-012318ac8f47-0" : {
   18043 "fileId" : "d398058e-f8f4-4772-9fcb-012318ac8f47-0",
   18044 "inserts" : 989333,
   
   deleteDataframe.write
                   .format("hudi")
                   .options(getQuickstartWriteConfigs)
                   .option(OPERATION_OPT_KEY, "delete")
                   .option(PRECOMBINE_FIELD_OPT_KEY, hudi_precombine_key)
                   .option(RECORDKEY_FIELD_OPT_KEY, hudi_key)
                   .option(PARTITIONPATH_FIELD_OPT_KEY, "")
                   .option(TABLE_NAME, tablename)
                   .option(TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
                   .option(HIVE_SYNC_ENABLED_OPT_KEY, "true")
                   .option(HIVE_URL_OPT_KEY, "jdbc:hive2://XXXXXX:10000/;principal=XXXX/XXXXXXX)
                   .option(HIVE_DATABASE_OPT_KEY, hudi_db)
                   .option(HIVE_TABLE_OPT_KEY, tablename)
                   .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[NonPartitionedExtractor].getName)
                   .option(PAYLOAD_CLASS_OPT_KEY, "org.apache.hudi.EmptyHoodieRecordPayload")
                   .mode(Append)
                   .save("/user/XXXXX/hudi/" + tablename)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-655261425


   Thanks Vinoth. I ll try to fetch the commit file but as of now I have now
   disabled hive sync for delete operation and now I don't get this error at
   all? Do you have any comment on this?.
   
   As I said,when I enable hive sync for delete operation I get this error but
   not when I disable it.
   
   On Wed, 8 Jul 2020, 6:07 am vinoth chandar, <no...@github.com>
   wrote:
   
   > 20200705101913__commit__COMPLETED indicates the commit was completed
   > successfully actually.. I think it's saying the commit file does not have a
   > data file, from which it can fetch schema. ..
   >
   > Do you mind sharing the 20200705101913.commit file from timeline?
   > @bhasudha <https://github.com/bhasudha> can you check if there are any
   > changes in 0.5.3 around this. My understanding is that currently we are
   > writing the schema together with the commit file, so it does not even have
   > to look for a data file..
   >
   > @RajasekarSribalan <https://github.com/RajasekarSribalan> if you are up
   > for it, and easy to do, can you give master branch a shot?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hudi/issues/1794#issuecomment-655211543>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AFMO6I26H5UBJRHI5LMVJPDR2O5TBANCNFSM4OQ2GZNQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] RajasekarSribalan commented on issue #1794: [SUPPORT] Hudi delete operation but HiveSync failed

Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #1794:
URL: https://github.com/apache/hudi/issues/1794#issuecomment-660030474


   Thank you for your response 😊
   
   On Fri, 17 Jul 2020, 4:02 pm Sivabalan Narayanan, <no...@github.com>
   wrote:
   
   > Closed #1794 <https://github.com/apache/hudi/issues/1794>.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hudi/issues/1794#event-3557508625>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AFMO6IY7WMYUC32D6VR4MH3R4ASD7ANCNFSM4OQ2GZNQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org