You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "PhantomHunt (via GitHub)" <gi...@apache.org> on 2023/04/25 10:28:36 UTC

[GitHub] [hudi] PhantomHunt opened a new issue, #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

PhantomHunt opened a new issue, #8572:
URL: https://github.com/apache/hudi/issues/8572

   We have created Hudi datalake with version 0.13.0
   We need to read data from a few tables in an incremental fashion.
   To fetch the active timelines for a MoR table, we are using the following piece of code where basePath is the S3 bucket path where data lies :
   ```
   metaClient=(spark._jvm.org.apache.hudi.common.table.HoodieTableMetaClient.builder().setConf(spark._jsc.hadoopConfiguration()).setBasePath(basePath).setLoadActiveTimelineOnLoad(True).build())
   
   timeline=metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants()
   
   instants= timeline.getInstants().collect(spark._jvm.java.util.stream.Collectors.toList()).toArray()
   
   map_timestamps=map(lambda x : x.getTimestamp(),instants)
   
   for a_ts in map_timestamps:
   	list_timestamps.append(a_ts)
   ```
   the output would be like this:
   `["20230410110310171", "20230410111802858", "20230410135802426", "20230410233706724", "20230411070325158", "20230412075305123", "20230412104308890", "20230412112440414", "20230412123348380", "20230412123408573", "20230412123426951", "20230412143444989", "20230412143503391", "20230413104721504", "20230413120831774", "20230413122750909", "20230413153023354", "20230414045300420", "20230414105813727", "20230414110336441", "20230414111346898", "20230414142833034", "20230414145746900", "20230414145806366", "20230418070525211", "20230418095219696", "20230419055721930", "20230419065820905", "20230419100813940", "20230419111328181", "20230419160833254", "20230420055335537", "20230420055920454", "20230420061332289", "20230420080546225", "20230420080606133", "20230420081457928", "20230420090733990", "20230420092736555", "20230420093835449", "20230420135133130"]`
   
   When we tried to read MoR table:
       ```
   mor={
           'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
           'hoodie.datasource.query.type':'incremental',
           'hoodie.datasource.read.begin.instanttime':'20230331130832572',
           'hoodie.datasource.read.end.instanttime':'20230410110310171'
       }
       try:
           df=spark.read.format("org.apache.hudi").options(**mor).load(path_to_table)
           return df
       except Exception as e:
           log.msg(e,"e")
           return None
   ```
   		
   we got the following error
   ```
   23/04/20 14:25:33 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 36)
   java.io.FileNotFoundException: No such file or directory: s3a://****/2df00b9b-9fae-45a4-8492-e11ef16740b3-0_0-298-497_20230410110310171.parquet
   ```
   
   The writer configuration is - 
   ```
   writer_config = {"fs.s3a.impl""fs.s3a.impl"                     
       'hoodie.datasource.write.operation': 'upsert',                
       'hoodie.datasource.write.precombine.field': 'cdc_timestamp',  
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',        
       'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',               
       'hoodie.schema.on.read.enable' : "true",                      
       'hoodie.datasource.write.reconcile.schema' : "true",
       'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',                                         
       'hoodie.table.name': table_name,                                     
       'hoodie.datasource.write.recordkey.field': 'id',           
       'hoodie.datasource.write.table.name': table_name,                     
       'hoodie.upsert.shuffle.parallelism': 200,                
       'hoodie.keep.max.commits': 50,                          
       'hoodie.keep.min.commits': 40,                          
       'hoodie.cleaner.commits.retained': 30      
       } 
   
   ```
   Language - Python
   Hudi Version - 0.13.0
   Job Type - Python script on EC2
   Table Type - Non Partitioned MOR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1538995505

   you are using some internal apis. so getCommitsTimeline will give you cleaned up and noncleaned commits. 
   right one to use is 
   ```
   timeline=metaClient.getActiveTimeline().getCleanerTimeline().filterCompletedInstants()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] PhantomHunt commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "PhantomHunt (via GitHub)" <gi...@apache.org>.
PhantomHunt commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1532771681

   @nsivabalan any updates on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] PhantomHunt commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "PhantomHunt (via GitHub)" <gi...@apache.org>.
PhantomHunt commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1539885087

   > you are using some internal apis. so getCommitsTimeline will give you cleaned up and noncleaned commits. right one to use is
   > 
   > ```
   > timeline=metaClient.getActiveTimeline().getCleanerTimeline().filterCompletedInstants()
   > ```
   
   Hi @nsivabalan, Thanks for the help.
   We tried the above code as you suggested, but it didn't give us the desired output.
   
   We observed that getCleanerTimeline() just gave the list of timelines when the cleaner had ran. What we actually need is the list of timelines that exist after running the cleaner!
   
   For example:
   We inserted data in the Hudi table at the below-given timelines:
   1
   2
   3
   4
   5
   6
   7
   Then at the 8th moment, the cleaner ran and cleaned timelines 1 and 2. We intend to fetch all the remaining committed timelines as output, i.e. : [3,4,5,6,7]
   
   However, both the following code blocks don't return the above output - 
   committed timelines code: metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants() is returning [1,2,3,4,5,6,7]
   cleaner timelines code : metaClient.getActiveTimeline().getCleanerTimeline().filterCompletedInstants() is returning [8]
   
   So, can you please suggest some other internal APIs / approaches which can give us the committed timelines that actually exist after the cleaner ran?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1523538698

   Generally incremental query will work only if cleaner has not run. 
   for eg, if you have 100 commits in your timeline and cleaner has cleaned up the data pertaining to first 25 commits for eg. incremental query is bound to fail if we try to query using the first 25 commits. But if the incremental query the latest 75 commits, you should not see any FileNotFound issue. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] PhantomHunt commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "PhantomHunt (via GitHub)" <gi...@apache.org>.
PhantomHunt commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1537954172

   Any updates?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] PhantomHunt commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "PhantomHunt (via GitHub)" <gi...@apache.org>.
PhantomHunt commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1527698250

   Any updates on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] PhantomHunt commented on issue #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

Posted by "PhantomHunt (via GitHub)" <gi...@apache.org>.
PhantomHunt commented on issue #8572:
URL: https://github.com/apache/hudi/issues/8572#issuecomment-1523877536

   > Generally incremental query will work only if cleaner has not run. for eg, if you have 100 commits in your timeline and cleaner has cleaned up the data pertaining to first 25 commits for eg. incremental query is bound to fail if we try to query using the first 25 commits. But if the incremental query the latest 75 commits, you should not see any FileNotFound issue.
   
   Hi Shiv,
   Thanks for the reply.
   I agree with your point that if the cleaner removed the first 25 commits, then any incremental query for that earlier commit timeline would result in a FileNotFound error.
   
   However, my original intention was to inquire about why getActiveTimeline() returned the timelines that were removed by the cleaner. Shouldn't it have only returned the remaining 75 timelines?
   
   Is there another function available that I can use to retrieve a list of the preserved timelines after the cleaner has removed certain ones?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org