You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Vinay (JIRA)" <ji...@apache.org> on 2017/07/29 14:05:02 UTC
[jira] [Commented] (FLINK-7289) Memory allocation of RocksDB can be problematic in container environments

    [ https://issues.apache.org/jira/browse/FLINK-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106133#comment-16106133 ] 

Vinay commented on FLINK-7289:
------------------------------

Hi Stefan,

I have mainly used RocksDB on EMR backed up by SSD's and 122GB memory. 
Although  FLASH_SSD_OPTION is good, it does not provide control over the amount of memory to be used. So I had tuned some parameters with the below configurations :

{code:java}

DBOptions:
     (along with the FLASH_SSD_OPTIONS add the following)
     maxBackgroundCompactions(4)
    
ColumnFamilyOptions:
  max_buffer_size : 512 MB
  block_cache_size : 128 MB
  max_write_buffer_number : 5
  minimum_buffer_number_to_merge : 2
  cacheIndexAndFilterBlocks : true
  optimizeFilterForHits: true
{code}

According to the documentation when {code:java}  optimizeFilterForHits: true {code} is set, RocksDB will not build bloom filters on the last level which contains 90% of DB. Thus the memory usage for bloom filters will be 10x less.

As RocksDB uses a lot of memory , if we cancel the job in between the memory used is not reclaimed. For Example: assuming that the job is running for 1 hour and the memory used is 50GB , now when we cancel the job from UI the memory is not reclaimed.
I have observed this case when I had run the job on YARN.

I order to reclaim the memory I had to manually run the following command on each node of EMR:
{code:java}
sync; echo 3 > /proc/sys/vm/drop_caches
sync; echo 2 > /proc/sys/vm/drop_caches
sync; echo 1 > /proc/sys/vm/drop_caches
{code}


> Memory allocation of RocksDB can be problematic in container environments
> -------------------------------------------------------------------------
>
>                 Key: FLINK-7289
>                 URL: https://issues.apache.org/jira/browse/FLINK-7289
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>            Reporter: Stefan Richter
>
> Flink's RocksDB based state backend allocates native memory. The amount of allocated memory by RocksDB is not under the control of Flink or the JVM and can (theoretically) grow without limits.
> In container environments, this can be problematic because the process can exceed the memory budget of the container, and the process will get killed. Currently, there is no other option than trusting RocksDB to be well behaved and to follow its memory configurations. However, limiting RocksDB's memory usage is not as easy as setting a single limit parameter. The memory limit is determined by an interplay of several configuration parameters, which is almost impossible to get right for users. Even worse, multiple RocksDB instances can run inside the same process and make reasoning about the configuration also dependent on the Flink job.
> Some information about the memory management in RocksDB can be found here:
> https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
> We should try to figure out ways to help users in one or more of the following ways:
> - Some way to autotune or calculate the RocksDB configuration.
> - Conservative default values.
> - Additional documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)