You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/12 07:30:22 UTC

[GitHub] [hudi] kasured opened a new issue, #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

kasured opened a new issue, #5298:
URL: https://github.com/apache/hudi/issues/5298

   **Describe the problem you faced**
   
   When inline compaction is turned on and when the actual compaction plan is completed, the commit file is referencing the file which has been deleted during the compaction process. Later, this is causing the reader to fail with FileNotFoudException
   
   **To Reproduce**
   
   I managed to reproduce the issue on a constant basis. After the first compaction action is completed it causes all subsequent reads to fail, because the commit file is referencing the already deleted parquet file on the system. Please see Additional Context session for more details. The issues can only be reproduced when multiple tables are used within the same SparkSession.
   
   **Expected behavior**
   
   After inline compaction the commit files in .hoodie folder are in sync with the files in the file system. Also there are no files deleted during the compaction.
   
   **Environment Description**
   
   * EMR version: 6.5.0
   
   * Hudi version : 0.9.0-amzn-1	
   
   * Spark version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   We are using Spark streaming with Kafka topics as a source. Topic -> foreachBatch -> Dataframe write -> Hudi MOR table. For each table we are using the following related configuration options
   
   ```
           "hoodie.datasource.write.table.type" = "MERGE_ON_READ"
           "hoodie.datasource.write.hive_style_partitioning" = "true"
           "hoodie.finalize.write.parallelism" = "4"
           "hoodie.upsert.shuffle.parallelism" = "4"
           "hoodie.compact.inline" = "true"
           "hoodie.compact.inline.max.delta.seconds" = "3600"
           "hoodie.compact.inline.trigger.strategy" = "TIME_ELAPSED"
           "hoodie.clean.automatic" = "true"
           "hoodie.cleaner.policy" = "KEEP_LATEST_COMMITS"
           "hoodie.cleaner.commits.retained" = "18"
           "hoodie.metadata.cleaner.commits.retained" = "18"
           "hoodie.keep.min.commits" = "36"
           "hoodie.keep.max.commits" = "72"
           "hoodie.clustering.inline" = "false"
           "hoodie.clustering.inline.max.commits" = "4"
           "hoodie.clustering.plan.strategy.target.file.max.bytes" = "1073741824"
           "hoodie.clustering.plan.strategy.small.file.limit" = "629145600"
           "hoodie.metadata.enable" = "false"
           "hoodie.metadata.keep.min.commits" = "36"
           "hoodie.metadata.keep.max.commits" = "72"
           "hoodie.datasource.compaction.async.enable" = "true"
   ```
   **Course of Events**
   
   Let us take the file which the reader tries to find 4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1177-14259_20220411202305.parquet and show how this is changed 
   
   * Compaction completed and and there is also no cleans yet to delete the old files
   ╔═════════════════════════╤═══════════╤═══════════════════════════════╗
   ║ Compaction Instant Time │ State     │ Total FileIds to be Compacted ║
   ╠═════════════════════════╪═══════════╪═══════════════════════════════╣
   ║ 20220411202305          │ COMPLETED │ 3                             ║
   ╚═════════════════════════╧═══════════╧═══════════════════════════════╝
   
   ═══════════╤═════════════════════════╤═════════════════════╤══════════════════╗
   ║ CleanTime │ EarliestCommandRetained │ Total Files Deleted │ Total Time Taken ║
   ╠═══════════╧═════════════════════════╧═════════════════════╧══════════════════╣
   ║ (empty)                                                                      ║
   ╚══════════════════════════════════════════════════════════════════════════════╝
   
   * On s3 we can see the following timeline for the compaction process. Please mark the modification time 
   20220411202305.commit			         commit		April 11, 2022, 22:23:55 (UTC+02:00)
   20220411202305.compaction.inflight	         inflight	        April 11, 2022, 22:23:08 (UTC+02:00)
   20220411202305.compaction.requested	 requested	April 11, 2022, 22:23:07 (UTC+02:00)
   
   * On S3 we can see the following 
   4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1177-14259_20220411202305.parquet	Delete marker		April 11, 2022, 22:23:55 (UTC+02:00)	
   4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1177-14259_20220411202305.parquet	parquet			April 11, 2022, 22:23:28 (UTC+02:00)
   	
   4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1198-14280_20220411202305.parquet	parquet			April 11, 2022, 22:23:54 (UTC+02:00)
   4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_2-75-1434_20220411191603.parquet		parquet			April 11, 2022, 21:19:15 (UTC+02:00)
   
   Please pay attention to the fact that the file under consideration has been deleted with the delete marker at the same time the compaction commit happened which is 22:23:55. Also please pay attention that the only thing that changed is the writeToken. After that moment there is a new file 4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1198-14280_20220411202305.parquet. However, this file is not reflected in 20220411202305.commit which can be seen below
   
   ```
   "fileId" : "4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0",
   "path" : "cluster=96/shard=14377/4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1177-14259_20220411202305.parquet",
   "prevCommit" : "20220411191603",
   "numWrites" : 122486,
   "numDeletes" : 0,
   "numUpdateWrites" : 122457,
   "numInserts" : 0,
   "totalWriteBytes" : 7528604,
   "totalWriteErrors" : 0,
   "tempPath" : null,
   "partitionPath" : "cluster=96/shard=14377",
   "totalLogRecords" : 846489,
   "totalLogFilesCompacted" : 7,
   "totalLogSizeCompacted" : 325539587,
   "totalUpdatedRecordsCompacted" : 122457,
   "totalLogBlocks" : 7,
   "totalCorruptLogBlock" : 0,
   "totalRollbackBlocks" : 0,
   "fileSizeInBytes" : 7528604,
   "minEventTime" : null,
   "maxEventTime" : null
   
   "fileIdAndRelativePaths" : {
       "4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0" : "cluster=96/shard=14377/4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1177-14259_20220411202305.parquet",
       "0139f10d-7a88-481b-b5df-6516500076b0-0" : "cluster=96/shard=14377/0139f10d-7a88-481b-b5df-6516500076b0-0_0-1177-14258_20220411202305.parquet",
       "21000940-a573-4c46-8ad5-79003ac9daf5-0" : "cluster=96/shard=14377/21000940-a573-4c46-8ad5-79003ac9daf5-0_2-1177-14260_20220411202305.parquet"
     },
   "totalRecordsDeleted" : 0,
   "totalLogRecordsCompacted" : 2543875,
   "totalLogFilesCompacted" : 21,
   "totalCompactedRecordsUpdated" : 368137,
   "totalLogFilesSize" : 978467842,
   "totalScanTime" : 45421,
   ``` 
   
   * Now when I check the file system view with the command `show fsview latest` Hudi shows the new file but not deleted
   ```
   ║ Partition               │ FileId                                 │ Base-Instant   │ Data-File                                               │ Data-File Size │ Num Delta Files │ Total Delta Size │ Delta Size - compaction scheduled │ Delta Size - compaction unscheduled │ Delta To Base Ratio - compaction scheduled │ Delta To Base Ratio - compaction unscheduled │ Delta Files - compaction scheduled                                                                           │ Delta Files - compaction unscheduled ║
   
   ║ cluster=96/shard=14377/ │ 4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0 │ 20220411202305 │ s3://some-bucket/landing-zone/some_table/cluster=96/shard=14377/4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1198-14280_20220411202305.parquet │ 7.2 MB         │ 4               │ 178.2 MB         │ 178.2 MB                          │ 0.0 B                               │ 24.814273985207016         │ 0.0                                          │ [HoodieLogFile{pathStr='s3://some-bucket/landing-zone/some_table/cluster=96/shard=14377/.4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_20220411202305.log.4_0-1758-20808', fileLen=46785516}, HoodieLogFile{pathStr='s3://some-bucket/landing-zone/some_table/cluster=96/shard=14377/.4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_20220411202305.log.3_1-1610-19092', fileLen=46440363}, HoodieLogFile{pathStr='s3://some-bucket/landing-zone/some_table/cluster=96/shard=14377/.4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_20220411202305.log.2_0-1450-17202', fileLen=4706
 8201}, HoodieLogFile{pathStr='s3://some-bucket/landing-zone/some_table/cluster=96/shard=14377/.4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_20220411202305.log.1_0-1297-15405', fileLen=46545393}] │ []  
   ```
   
   **Tried options**
   
   * Turn off File Sizing by setting hoodie.parquet.small.file.limit to 0 to make sure the file is not deleted
   
   * With one table the inline compaction is working as expected	
   
   **Stacktrace**
   
   ```
   Lost task 0.0 in stage 1.0 (TID 1) (ip.ec2.internal executor 2): java.io.FileNotFoundException: No such file or directory 's3://some-bucket/landing-zone/some_table/cluster=96/shard=14377/4ce009e6-5622-4874-bb5d-a11e3bb9eaa3-0_1-1177-14259_20220411202305.parquet'
   	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:521)
   	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
   	at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:61)
   	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:456)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:318)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1(ParquetFileFormat.scala:317)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:319)
   	at org.apache.hudi.HoodieMergeOnReadRDD.read(HoodieMergeOnReadRDD.scala:105)
   	at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:77)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:750)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1102918926

   I can see that the fix version is 0.11.0. Can this patch be safely backported to 0.9.0 though?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1104409226

   @nsivabalan We were able to reproduce the similar scenario locally. Please, use the following repository to check and confirm on your end https://github.com/kasured/hudi-compaction-5298 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] umehrot2 commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1103314606

   @kasured I am from AWS EMR team. I hope you have opened a ticket with AWS support and we can work through that channel to backport this fix and provide you patched jars.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rahil-c commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
rahil-c commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1116915923

   @kasured If you have opened a case with AWS EMR support, we have a backport of the fix for hudi 0.9.0 we can provide you. Let us know so we can close this thread out for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100686361

   yes, I really appreciate your digging in deeper. 
   let me try to understand the concurrency here. 
   what do you mean by multiple concurrent streaming writes? there are 3 streams reading from diff upstream sources and writing to 1 hudi table? or one streaming pipeline which writes to 3 different hudi tables? or 3 different streaming pipeline writing to 3 diff hudi table but using same spark session ? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100717281

   @nsivabalan Sure, let me provide more details. There is a StreamingQuery entity which s started by Spark to consume the stream. This is basically what we use and described here https://hudi.apache.org/docs/compaction#spark-structured-streaming
   
   So what we do is we create multiple StreamingQuery streams and start them. Each of them though consumes from single kafka topic and writes to single Hudi table. So it is `3 different streaming pipeline writing to 3 diff hudi table but using same spark session` with the only exception that we use 3 different SparkSession objects. Each of them are reusing single sparkContext which is okay as there should be only one spark context per jvm.
   
   As to 4753 I have already specified it in the section **Possibly Related Issues** HUDI-3370. However, from what I checked it is related to metadata service which we do not use "hoodie.metadata.enable" = "false". May it also be relevant even if we do not use metadata table? I am asking cause we are using 0.9.0 from Amazon and I will need to replace it with the one with patch  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rahil-c commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
rahil-c commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1112477307

   @kasured Im currently looking into this and was able to reproduce the issue after running your [https://github.com/kasured/hudi-compaction-5298](https://github.com/kasured/hudi-compaction-5298/tree/main/src/main/scala/com/example/hudi). 
   
   When I built hudi 0.11.0 (from master) with profile `spark-3.1.2` and provided my 0.11.0 `spark bundle jar` to your sample repro I noticed that I did not get the `FileNotFoundException`. Im currently looking to see if the fix https://github.com/apache/hudi/pull/4753 can be backported to 0.9.0 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1210064535

   Closing this issue as its already been fixed. thanks for raising the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
URL: https://github.com/apache/hudi/issues/5298


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1098508902

   After changing the code and removing foreachBatch we were able to fix the issue https://github.com/apache/hudi/issues/2043#issuecomment-682100271. However, now the issue is reproducible for both inline and async variants of compaction.
   
   I have updated the section **Main observations so far**. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100687324

   btw, we did fix an issue wrt how spark lazy initialization and caching of results could result in wrong files in commit metadata https://github.com/apache/hudi/pull/4753. looks like exactly matching what you are reporting. 
   Can you try applying the patch and let us know if you still see the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Khushbukela commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
Khushbukela commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1346010346

   > Closing this issue as its already been fixed. thanks for raising the issue.
   
   Hi @nsivabalan can you please help to solve the above issue?
   I am facing the same problem.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100580381

   @kasured : before I dive in, few pointers on the write configs used.
   1. I see you have enabled both inline and async compaction. Guess w/ streaming sink to hudi, only async compaction is possible and for MOR table, hudi automatically does async compaction. So, probably you can remove these configs. 
   ```
   "hoodie.compact.inline" = "true"
    "hoodie.datasource.compaction.async.enable" = "true"
   ```
   
   2. and I also see you have enabled clustering. can we disable clustering and see if the issue is still reproducible. 
   
   with these changes, can you let us know if the problem still persists? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1102609068

   its not related to metadata table as such. essentially, the actual data files as part of the compaction commit could be different from what is found in compaction commit metadata. So, when reconciling markers, we may delete unintended files. 
   yes, it is applicable even if you don't enable metadata. 
   
   and thanks for clarifying your use-case. I get it now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100619414

   @nsivabalan Thank you for looking into that. I have updated the configuration in the description as it was a little out of date. Since the creation of the ticket you can see that I have tried multiple options.
   
   1. At first iteration I had foreachBatch which was not causing async compaction to happen (please see the linked issues). After the code was rewritten to use just structured streaming constructs async compaction started to be scheduled and executed. So I have tried both inline enabled with async disabled, and vice versa and the issue that I describe is reproduced in both cases
   2. Not sure what you mean as I have cluster disabled explicitly with "hoodie.clustering.inline" = "false". And also I have not seen any clustering actions neither in the .hoodie nor in the logs 
   
   All in all please check these two sections **Main observations so far** and **Tried Options**. They are up to date and have the summary of all that I have tried so far


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] umehrot2 commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1103323341

   @kasured is it possible for you to have 3 separete spark applications that do not share spark context ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

Posted by GitBox <gi...@apache.org>.
kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1099994068

   Upon further investigation and after enabling additional logs on EMR, the deletion of the file during compaction is happening in the class org.apache.hudi.table.HoodieTable#reconcileAgainstMarkers
   
   ```
   if (!invalidDataPaths.isEmpty()) {
           LOG.info("Removing duplicate data files created due to spark retries before committing. Paths=" + invalidDataPaths);`
   ```
   
   However, later in the logs this file is written and commited in the instant 
   ```
   INFO SparkRDDWriteClient: Committing Compaction 20220414232316. Finished with result HoodieCommitMetadata{partitionToWriteStats={cluster=96/shard=14377=[HoodieWriteStat{fileId='9d9f72e9-9381-40d0-af0c-cb48c25bd78d-0', path='cluster=96/shard=14377/9d9f72e9-9381-40d0-af0c-cb48c25bd78d-0_0-617-7132_20220414232316.parquet', prevCommit='20220414225217', numWrites=122886, numDeletes=0, numUpdateWrites=121939, totalWriteBytes=23331178, totalWriteErrors=0, tempPath='null', partitionPath='cluster=96/shard=14377', totalLogRecords=341027, totalLogFilesCompacted=3, totalLogSizeCompacted=285373803, totalUpdatedRecordsCompacted=121939, totalLogBlocks=9, totalCorruptLogBlock=0, totalRollbackBlocks=0}]}, compacted=true,
   ```
   So it leaves the system in an inconsistent state. It looks like some concurrency issues to me
   
   I will try to submit multiple StreamingQuery in different threads by leveraging spark scheduling pool. Will update about the status


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org