You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/03/23 12:31:04 UTC

[GitHub] [incubator-hudi] smdahmed opened a new issue #613: Too many S3 Connections being opened while df.write

smdahmed opened a new issue #613: Too many S3 Connections being opened while df.write
URL: https://github.com/apache/incubator-hudi/issues/613
 
 
   1. Table clean is called for every dataframe write (https://github.com/apache/incubator-hudi/blob/9e59da7fd96d8af0587efa2f34c89f56320d4c91/hoodie-client/src/main/java/com/uber/hoodie/table/HoodieCopyOnWriteTable.java#L293)
   1. This calls cleanPartitionPaths (https://github.com/apache/incubator-hudi/blob/9e59da7fd96d8af0587efa2f34c89f56320d4c91/hoodie-client/src/main/java/com/uber/hoodie/table/HoodieCopyOnWriteTable.java#L305)
   1. For a date partitioned table for say 3 years worth of data, it opens about 365*3=1095 s3 https connections from spark. 
   1. Even after the df.write function call, these options are never closed until the spark application exits. See a sample pseudo code to illustrate this issue better:
     ```scala
      hudiDF = spark.read.json(.....) // step 1
      df.write.format("com.uber.hoodie").... // step 2
     
      /* do something else - write another hoodie dataframe unrelated to above 
         hoodie dataframe to another s3 location */ // step 3
     ```
   
   At step 3 all the 1000+ connections to S3 are never closed. So any other creation of a hoodie dataframe in the same application with a write will fail. Usual limits for the s3 connections property: "spark.hadoop.fs.s3a.connection.maximum" is set to 500 or max 1500. I think files being opened while write during cleaning stage objects should be closed by respective executor tasks. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services