You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "PhantomHunt (via GitHub)" <gi...@apache.org> on 2023/04/25 10:28:36 UTC

[GitHub] [hudi] PhantomHunt opened a new issue, #8572: [SUPPORT] Getting java.io.FileNotFoundException when reading MOR table.

PhantomHunt opened a new issue, #8572:
URL: https://github.com/apache/hudi/issues/8572

   We have created Hudi datalake with version 0.13.0
   We need to read data from a few tables in an incremental fashion.
   To fetch the active timelines for a MoR table, we are using the following piece of code where basePath is the S3 bucket path where data lies :
   ```
   metaClient=(spark._jvm.org.apache.hudi.common.table.HoodieTableMetaClient.builder().setConf(spark._jsc.hadoopConfiguration()).setBasePath(basePath).setLoadActiveTimelineOnLoad(True).build())
   
   timeline=metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants()
   
   instants= timeline.getInstants().collect(spark._jvm.java.util.stream.Collectors.toList()).toArray()
   
   map_timestamps=map(lambda x : x.getTimestamp(),instants)
   
   for a_ts in map_timestamps:
   	list_timestamps.append(a_ts)
   ```
   the output would be like this:
   `["20230410110310171", "20230410111802858", "20230410135802426", "20230410233706724", "20230411070325158", "20230412075305123", "20230412104308890", "20230412112440414", "20230412123348380", "20230412123408573", "20230412123426951", "20230412143444989", "20230412143503391", "20230413104721504", "20230413120831774", "20230413122750909", "20230413153023354", "20230414045300420", "20230414105813727", "20230414110336441", "20230414111346898", "20230414142833034", "20230414145746900", "20230414145806366", "20230418070525211", "20230418095219696", "20230419055721930", "20230419065820905", "20230419100813940", "20230419111328181", "20230419160833254", "20230420055335537", "20230420055920454", "20230420061332289", "20230420080546225", "20230420080606133", "20230420081457928", "20230420090733990", "20230420092736555", "20230420093835449", "20230420135133130"]`
   
   When we tried to read MoR table:
       ```
   mor={
           'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
           'hoodie.datasource.query.type':'incremental',
           'hoodie.datasource.read.begin.instanttime':'20230331130832572',
           'hoodie.datasource.read.end.instanttime':'20230410110310171'
       }
       try:
           df=spark.read.format("org.apache.hudi").options(**mor).load(path_to_table)
           return df
       except Exception as e:
           log.msg(e,"e")
           return None
   ```
   		
   we got the following error
   ```
   23/04/20 14:25:33 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 36)
   java.io.FileNotFoundException: No such file or directory: s3a://****/2df00b9b-9fae-45a4-8492-e11ef16740b3-0_0-298-497_20230410110310171.parquet
   ```
   
   The writer configuration is - 
   ```
   writer_config = {"fs.s3a.impl""fs.s3a.impl"                     
       'hoodie.datasource.write.operation': 'upsert',                
       'hoodie.datasource.write.precombine.field': 'cdc_timestamp',  
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',        
       'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',               
       'hoodie.schema.on.read.enable' : "true",                      
       'hoodie.datasource.write.reconcile.schema' : "true",
       'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',                                         
       'hoodie.table.name': table_name,                                     
       'hoodie.datasource.write.recordkey.field': 'id',           
       'hoodie.datasource.write.table.name': table_name,                     
       'hoodie.upsert.shuffle.parallelism': 200,                
       'hoodie.keep.max.commits': 50,                          
       'hoodie.keep.min.commits': 40,                          
       'hoodie.cleaner.commits.retained': 30      
       } 
   
   ```
   Language - Python
   Hudi Version - 0.13.0
   Job Type - Python script on EC2
   Table Type - Non Partitioned MOR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org