You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "施晓罡(星罡)" <xi...@alibaba-inc.com> on 2017/02/20 02:06:42 UTC

回复:Checkpointing with RocksDB as statebackend

Hi Vinay
Can you provide the LOG file in RocksDB? It helps a lot to figure out the problems becuse it records the options and the events happened during the execution. Otherwise configured, it should locate at the path set in System.getProperty("java.io.tmpdir"). 
Typically, a large amount of memory is consumed by RocksDB to store necessary indices. To avoid the unlimited growth in the memory consumption, you can put these indices into block cache (set CacheIndexAndFilterBlock to true) and properly set the block cache size. 
You can also increase the number of backgroud threads to improve the performance of flushes and compactions (via MaxBackgroundFlushes and MaxBackgroudCompactions).
In YARN clusters, task managers will be killed if their memory utilization exceeds the allocation size. Currently Flink does not count the memory used by RocksDB in the allocation. We are working on fine-grained resource allocation (see FLINK-5131). It may help to avoid such problems.
May the information helps you.
Regards,Xiaogang

------------------------------------------------------------------发件人:Vinay Patil <vi...@gmail.com>发送时间:2017年2月17日(星期五) 21:19收件人:user <us...@flink.apache.org>主 题:Re: Checkpointing with RocksDB as statebackend
Hi Guys,

There seems to be some issue with RocksDB memory utilization.

Within few minutes of job run the physical memory usage increases by 4-5 GB and it keeps on increasing.
I have tried different options for Max Buffer Size(30MB, 64MB, 128MB , 512MB) and Min Buffer to Merge as 2, but the physical memory keeps on increasing.

According to RocksDB documentation, these are the main options on which flushing to storage is based.

Can you please point me where am I doing wrong. I have tried different configuration options but each time the Task Manager is getting killed after some time :)
Regards,Vinay Patil

On Thu, Feb 16, 2017 at 6:02 PM, Vinay Patil <vi...@gmail.com> wrote:
I think its more of related to RocksDB, I am also not aware about RocksDB but reading the tuning guide to understand the important values that can be set
Regards,Vinay Patil

On Thu, Feb 16, 2017 at 5:48 PM, Stefan Richter [via Apache Flink User Mailing List archive.] <ml...@n4.nabble.com> wrote:


	What kind of problem are we talking about? S3 related or RocksDB related. I am not aware of problems with RocksDB per se. I think seeing logs for this would be very helpful.
Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <[hidden email]>:
[hidden email] and [hidden email] could this be the same problem that you recently saw when working with other people?

On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email]> wrote:
Hi Guys,

Can anyone please help me with this issue
Regards,Vinay Patil

On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email]> wrote:
Hi Ted,

I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3 sink and the 3rd box is window operator followed by chained operators and a s3 sink

So in the details link section I can see that that S3 sink is taking time for the acknowledgement and it is not even going to the window operator chain.

But as shown in the snapshot ,checkpoint id 19 did not get any acknowledgement. Not sure what is causing the issue
Regards,Vinay Patil

On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:


	What did the More Details link say ?


Thanks 


> On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]> wrote:

> 

> Hi,

> 

> I have kept the checkpointing interval to 6secs and minimum pause between

> checkpoints to 5secs, while testing the pipeline I have observed that that

> for some checkpoints it is taking long time , as you can see in the attached

> snapshot checkpoint id 19 took the maximum time before it gets failed,

> although it has not received any acknowledgements, now during this 10minutes

> the entire pipeline did not make any progress and no data was getting

> processed. (For Ex : In 13minutes 20M records were processed and when the

> checkpoint took time there was no progress for the next 10minutes)

> 

> I have even tried to set max checkpoint timeout to 3min, but in that case as

> well multiple checkpoints were getting failed.

> 

> I have set RocksDB FLASH_SSD_OPTION 

> What could be the issue ? 

> 

> P.S. I am writing to 3 S3 sinks 

> 

> checkpointing_issue.PNG
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG>  

> 

> 

> 

> --

> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.



	
	
	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html
	
	
		To start a new topic under Apache Flink User Mailing List archive., email [hidden email] 

		To unsubscribe from Apache Flink User Mailing List archive., click here.

		NAML
	



	
	
	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11673.html
	
	
		To start a new topic under Apache Flink User Mailing List archive., email ml-node+s2336050n1h83@n4.nabble.com 

		To unsubscribe from Apache Flink User Mailing List archive., click here.

		NAML