You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ahmar Suhail (Jira)" <ji...@apache.org> on 2022/06/15 14:41:00 UTC

[jira] [Comment Edited] (HADOOP-18291) SingleFilePerBlockCache does not have a limit

    [ https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554632#comment-17554632 ] 

Ahmar Suhail edited comment on HADOOP-18291 at 6/15/22 2:40 PM:
----------------------------------------------------------------

[~stevel@apache.org] I noticed that files don't ever get deleted from the disk cache, though the access pattern mentioned above is probably not very common..is this something that we need to worry about now and fix? if yes, what's a reasonable initial limit for the cache?

 

Edit: All files are deleted on close() but not otherwise, wondering if this can be a problem for longer running processes 

 


was (Author: JIRAUSER283484):
[~stevel@apache.org] I noticed that files don't ever get deleted from the disk cache, though the access pattern mentioned above is probably not very common..is this something that we need to worry about now and fix? if yes, what's a reasonable initial limit for the cache?

> SingleFilePerBlockCache does not have a limit
> ---------------------------------------------
>
>                 Key: HADOOP-18291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18291
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Ahmar Suhail
>            Priority: Major
>
> Currently there is no limit on the size of disk cache. This means we could have a large number of files on files, especially for access patterns that are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 blocks). When a block is fully read, and a seek is issued it's released [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109]. We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores data of the block furthest from the current block (similar to the in memory cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org