You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "dchristle (via GitHub)" <gi...@apache.org> on 2023/01/24 18:01:02 UTC

[GitHub] [iceberg] dchristle commented on issue #3703: DeleteOrphanFiles or ExpireSnapshots outofmemory

dchristle commented on issue #3703:
URL: https://github.com/apache/iceberg/issues/3703#issuecomment-1402362942

   I'm following up to say I got `deleteOrphanFiles` to complete successfully. After bumping the memory, I was confused why I didn't see any output in the logs from an occasional `RetryHttpInitializer: Encountered status code 503 when sending DELETE request to URL` error. I let it run for more than 24 hours; it seemed like the driver was hung rather than deleting any orphan files. 
   
   In other GitHub issues on deleting orphan files, increasing the number of threads is mentioned. I modified my Spark job to do this with `.executeDeleteWith`:
   
   ```
   val executorService = Executors.newFixedThreadPool(30)
   
   SparkActions
       .get()
       .deleteOrphanFiles(icebergTable)
       .executeDeleteWith(executorService)
       .execute()
   ```
   
   The frequency of the 503 retry errors went up. My interpretation is these errors have some small fixed probability of occurring on a Google Storage delete operation. Since there are now 30 concurrent delete operations, the log message is seen more frequently.
   
   I let this new job run for about 36 hours & it finished deleting orphan files successfully. I wonder if there's some way to emit periodic log messages indicating the number of files that have been deleted, perhaps every 5 minutes. Once my driver had sufficient memory, the deletes were likely happening correctly, but as a user, I was confused when I didn't see any log output. The delete orphan files operation is different from other maintenance operations -- it can't be seen in the Spark UI as a job or stage. 
   
   Any thoughts on adding some periodic log outputs? @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org