You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by vinay patil <vi...@gmail.com> on 2017/02/15 11:11:11 UTC

Checkpointing with RocksDB as statebackend

Hi,

I have kept the checkpointing interval to 6secs and minimum pause between
checkpoints to 5secs, while testing the pipeline I have observed that that
for some checkpoints it is taking long time , as you can see in the attached
snapshot checkpoint id 19 took the maximum time before it gets failed,
although it has not received any acknowledgements, now during this 10minutes
the entire pipeline did not make any progress and no data was getting
processed. (For Ex : In 13minutes 20M records were processed and when the
checkpoint took time there was no progress for the next 10minutes)

I have even tried to set max checkpoint timeout to 3min, but in that case as
well multiple checkpoints were getting failed.

I have set RocksDB FLASH_SSD_OPTION 
What could be the issue ? 

P.S. I am writing to 3 S3 sinks 

checkpointing_issue.PNG
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG>  



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

回复：Checkpointing with RocksDB as statebackend

Posted by "施晓罡(星罡)" <xi...@alibaba-inc.com>.

Hi Vinay
Can you provide the LOG file in RocksDB? It helps a lot to figure out the problems becuse it records the options and the events happened during the execution. Otherwise configured, it should locate at the path set in System.getProperty("java.io.tmpdir"). 
Typically, a large amount of memory is consumed by RocksDB to store necessary indices. To avoid the unlimited growth in the memory consumption, you can put these indices into block cache (set CacheIndexAndFilterBlock to true) and properly set the block cache size. 
You can also increase the number of backgroud threads to improve the performance of flushes and compactions (via MaxBackgroundFlushes and MaxBackgroudCompactions).
In YARN clusters, task managers will be killed if their memory utilization exceeds the allocation size. Currently Flink does not count the memory used by RocksDB in the allocation. We are working on fine-grained resource allocation (see FLINK-5131). It may help to avoid such problems.
May the information helps you.
Regards,Xiaogang

------------------------------------------------------------------发件人：Vinay Patil <vi...@gmail.com>发送时间：2017年2月17日(星期五) 21:19收件人：user <us...@flink.apache.org>主　题：Re: Checkpointing with RocksDB as statebackend
Hi Guys,

There seems to be some issue with RocksDB memory utilization.

Within few minutes of job run the physical memory usage increases by 4-5 GB and it keeps on increasing.
I have tried different options for Max Buffer Size(30MB, 64MB, 128MB , 512MB) and Min Buffer to Merge as 2, but the physical memory keeps on increasing.

According to RocksDB documentation, these are the main options on which flushing to storage is based.

Can you please point me where am I doing wrong. I have tried different configuration options but each time the Task Manager is getting killed after some time :)
Regards,Vinay Patil

On Thu, Feb 16, 2017 at 6:02 PM, Vinay Patil <vi...@gmail.com> wrote:
I think its more of related to RocksDB, I am also not aware about RocksDB but reading the tuning guide to understand the important values that can be set
Regards,Vinay Patil

On Thu, Feb 16, 2017 at 5:48 PM, Stefan Richter [via Apache Flink User Mailing List archive.] <ml...@n4.nabble.com> wrote:

	What kind of problem are we talking about? S3 related or RocksDB related. I am not aware of problems with RocksDB per se. I think seeing logs for this would be very helpful.
Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <[hidden email]>:
[hidden email] and [hidden email] could this be the same problem that you recently saw when working with other people?

On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email]> wrote:
Hi Guys,

Can anyone please help me with this issue
Regards,Vinay Patil

On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email]> wrote:
Hi Ted,

I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3 sink and the 3rd box is window operator followed by chained operators and a s3 sink

So in the details link section I can see that that S3 sink is taking time for the acknowledgement and it is not even going to the window operator chain.

But as shown in the snapshot ,checkpoint id 19 did not get any acknowledgement. Not sure what is causing the issue
Regards,Vinay Patil

On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:

	What did the More Details link say ?

Thanks 

> On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]> wrote:

> 

> Hi,

> 

> I have kept the checkpointing interval to 6secs and minimum pause between

> checkpoints to 5secs, while testing the pipeline I have observed that that

> for some checkpoints it is taking long time , as you can see in the attached

> snapshot checkpoint id 19 took the maximum time before it gets failed,

> although it has not received any acknowledgements, now during this 10minutes

> the entire pipeline did not make any progress and no data was getting

> processed. (For Ex : In 13minutes 20M records were processed and when the

> checkpoint took time there was no progress for the next 10minutes)

> 

> I have even tried to set max checkpoint timeout to 3min, but in that case as

> well multiple checkpoints were getting failed.

> 

> I have set RocksDB FLASH_SSD_OPTION 

> What could be the issue ? 

> 

> P.S. I am writing to 3 S3 sinks 

> 

> checkpointing_issue.PNG
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG>  

> 

> 

> 

> --

> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

		If you reply to this email, your message will be added to the discussion below:
		http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html

		To start a new topic under Apache Flink User Mailing List archive., email [hidden email] 

		To unsubscribe from Apache Flink User Mailing List archive., click here.

		NAML

		If you reply to this email, your message will be added to the discussion below:
		http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11673.html

		To start a new topic under Apache Flink User Mailing List archive., email ml-node+s2336050n1h83@n4.nabble.com 

		To unsubscribe from Apache Flink User Mailing List archive., click here.

		NAML

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Guys,

There seems to be some issue with RocksDB memory utilization.

Within few minutes of job run the physical memory usage increases by 4-5 GB
and it keeps on increasing.
I have tried different options for Max Buffer Size(30MB, 64MB, 128MB ,
512MB) and Min Buffer to Merge as 2, but the physical memory keeps on
increasing.

According to RocksDB documentation, these are the main options on which
flushing to storage is based.

Can you please point me where am I doing wrong. I have tried different
configuration options but each time the Task Manager is getting killed
after some time :)

Regards,
Vinay Patil

On Thu, Feb 16, 2017 at 6:02 PM, Vinay Patil <vi...@gmail.com>
wrote:

> I think its more of related to RocksDB, I am also not aware about RocksDB
> but reading the tuning guide to understand the important values that can be
> set
>
> Regards,
> Vinay Patil
>
> On Thu, Feb 16, 2017 at 5:48 PM, Stefan Richter [via Apache Flink User
> Mailing List archive.] <ml...@n4.nabble.com> wrote:
>
>> What kind of problem are we talking about? S3 related or RocksDB related.
>> I am not aware of problems with RocksDB per se. I think seeing logs for
>> this would be very helpful.
>>
>> Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11673&i=0>>:
>>
>> [hidden email] <http:///user/SendEmail.jtp?type=node&node=11673&i=1> and [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=11673&i=2> could this
>> be the same problem that you recently saw when working with other people?
>>
>> On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11673&i=3>> wrote:
>>
>>> Hi Guys,
>>>
>>> Can anyone please help me with this issue
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=4>> wrote:
>>>
>>> Hi Ted,
>>>
>>> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3
>>> sink and the 3rd box is window operator followed by chained operators and a
>>> s3 sink
>>>
>>> So in the details link section I can see that that S3 sink is taking
>>> time for the acknowledgement and it is not even going to the window
>>> operator chain.
>>>
>>> But as shown in the snapshot ,checkpoint id 19 did not get any
>>> acknowledgement. Not sure what is causing the issue
>>>
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing
>>> List archive.] <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=5>> wrote:
>>>
>>> What did the More Details link say ?
>>>
>>> Thanks
>>>
>>> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11641&i=0>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I have kept the checkpointing interval to 6secs and minimum pause
>>> between
>>> > checkpoints to 5secs, while testing the pipeline I have observed that
>>> that
>>> > for some checkpoints it is taking long time , as you can see in the
>>> attached
>>> > snapshot checkpoint id 19 took the maximum time before it gets failed,
>>> > although it has not received any acknowledgements, now during this
>>> 10minutes
>>> > the entire pipeline did not make any progress and no data was getting
>>> > processed. (For Ex : In 13minutes 20M records were processed and when
>>> the
>>> > checkpoint took time there was no progress for the next 10minutes)
>>> >
>>> > I have even tried to set max checkpoint timeout to 3min, but in that
>>> case as
>>> > well multiple checkpoints were getting failed.
>>> >
>>> > I have set RocksDB FLASH_SSD_OPTION
>>> > What could be the issue ?
>>> >
>>> > P.S. I am writing to 3 S3 sinks
>>> >
>>> > checkpointing_issue.PNG
>>> > <http://apache-flink-user-mailing-list-archive.2336050.n4.
>>> nabble.com/file/n11640/checkpointing_issue.PNG>
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: http://apache-flink-user-maili
>>> ng-list-archive.2336050.n4.nabble.com/Checkpointing-with-
>>> RocksDB-as-statebackend-tp11640.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>> nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html
>>>
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=6>
>>>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11673.html
>> To start a new topic under Apache Flink User Mailing List archive., email
>> ml-node+s2336050n1h83@n4.nabble.com
>> To unsubscribe from Apache Flink User Mailing List archive., click here
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
>> .
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

I think its more of related to RocksDB, I am also not aware about RocksDB
but reading the tuning guide to understand the important values that can be
set

Regards,
Vinay Patil

On Thu, Feb 16, 2017 at 5:48 PM, Stefan Richter [via Apache Flink User
Mailing List archive.] <ml...@n4.nabble.com> wrote:

> What kind of problem are we talking about? S3 related or RocksDB related.
> I am not aware of problems with RocksDB per se. I think seeing logs for
> this would be very helpful.
>
> Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11673&i=0>>:
>
> [hidden email] <http:///user/SendEmail.jtp?type=node&node=11673&i=1> and [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=11673&i=2> could this
> be the same problem that you recently saw when working with other people?
>
> On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11673&i=3>> wrote:
>
>> Hi Guys,
>>
>> Can anyone please help me with this issue
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11673&i=4>> wrote:
>>
>> Hi Ted,
>>
>> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3
>> sink and the 3rd box is window operator followed by chained operators and a
>> s3 sink
>>
>> So in the details link section I can see that that S3 sink is taking time
>> for the acknowledgement and it is not even going to the window operator
>> chain.
>>
>> But as shown in the snapshot ,checkpoint id 19 did not get any
>> acknowledgement. Not sure what is causing the issue
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing
>> List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11673&i=5>> wrote:
>>
>> What did the More Details link say ?
>>
>> Thanks
>>
>> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11641&i=0>> wrote:
>> >
>> > Hi,
>> >
>> > I have kept the checkpointing interval to 6secs and minimum pause
>> between
>> > checkpoints to 5secs, while testing the pipeline I have observed that
>> that
>> > for some checkpoints it is taking long time , as you can see in the
>> attached
>> > snapshot checkpoint id 19 took the maximum time before it gets failed,
>> > although it has not received any acknowledgements, now during this
>> 10minutes
>> > the entire pipeline did not make any progress and no data was getting
>> > processed. (For Ex : In 13minutes 20M records were processed and when
>> the
>> > checkpoint took time there was no progress for the next 10minutes)
>> >
>> > I have even tried to set max checkpoint timeout to 3min, but in that
>> case as
>> > well multiple checkpoints were getting failed.
>> >
>> > I have set RocksDB FLASH_SSD_OPTION
>> > What could be the issue ?
>> >
>> > P.S. I am writing to 3 S3 sinks
>> >
>> > checkpointing_issue.PNG
>> > <http://apache-flink-user-mailing-list-archive.2336050.
>> n4.nabble.com/file/n11640/checkpointing_issue.PNG>
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-flink-user-
>> mailing-list-archive.2336050.n4.nabble.com/Checkpointing-
>> with-RocksDB-as-statebackend-tp11640.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.
>> n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-
>> tp11640p11641.html
>>
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=11673&i=6>
>>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-
> tp11640p11673.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11674.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Stefan Richter <s....@data-artisans.com>.

What kind of problem are we talking about? S3 related or RocksDB related. I am not aware of problems with RocksDB per se. I think seeing logs for this would be very helpful.

> Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <al...@apache.org>:
> 
> +Stefan Richter <ma...@data-artisans.com> and +Stephan Ewen <ma...@apache.org> could this be the same problem that you recently saw when working with other people?
> 
> On Wed, 15 Feb 2017 at 17:23 Vinay Patil <vinay18.patil@gmail.com <ma...@gmail.com>> wrote:
> Hi Guys,
> 
> Can anyone please help me with this issue
> 
> Regards,
> Vinay Patil
> 
> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <vinay18.patil@gmail.com <ma...@gmail.com>> wrote:
> Hi Ted,
> 
> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3 sink and the 3rd box is window operator followed by chained operators and a s3 sink
> 
> So in the details link section I can see that that S3 sink is taking time for the acknowledgement and it is not even going to the window operator chain.
> 
> But as shown in the snapshot ,checkpoint id 19 did not get any acknowledgement. Not sure what is causing the issue
> 
> Regards,
> Vinay Patil
> 
> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing List archive.] <ml-node+s2336050n11641h23@n4.nabble.com <ma...@n4.nabble.com>> wrote:
> What did the More Details link say ? 
> 
> Thanks 
> 
> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11641&i=0>> wrote: 
> > 
> > Hi, 
> > 
> > I have kept the checkpointing interval to 6secs and minimum pause between 
> > checkpoints to 5secs, while testing the pipeline I have observed that that 
> > for some checkpoints it is taking long time , as you can see in the attached 
> > snapshot checkpoint id 19 took the maximum time before it gets failed, 
> > although it has not received any acknowledgements, now during this 10minutes 
> > the entire pipeline did not make any progress and no data was getting 
> > processed. (For Ex : In 13minutes 20M records were processed and when the 
> > checkpoint took time there was no progress for the next 10minutes) 
> > 
> > I have even tried to set max checkpoint timeout to 3min, but in that case as 
> > well multiple checkpoints were getting failed. 
> > 
> > I have set RocksDB FLASH_SSD_OPTION 
> > What could be the issue ? 
> > 
> > P.S. I am writing to 3 S3 sinks 
> > 
> > checkpointing_issue.PNG 
> > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG>>   
> > 
> > 
> > 
> > -- 
> > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html>
> > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html>
> To start a new topic under Apache Flink User Mailing List archive., email ml-node+s2336050n1h83@n4.nabble.com <ma...@n4.nabble.com> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Aljoscha,

Which problem you are referring to ?

I am seeing unexpected stalls in between for a long time.

Also one thing I have observed with FLASH_SSD_OPTIMIZED option is that it
is using more amount of physical memory and not flushing the data to
storage.

I am trying to figure out the best possible rocksDB values for my
configuration, I am currently running the job on c3.4xlarge EC2 instances

Regards,
Vinay Patil

On Thu, Feb 16, 2017 at 4:22 PM, Aljoscha Krettek [via Apache Flink User
Mailing List archive.] <ml...@n4.nabble.com> wrote:

> [hidden email] <http:///user/SendEmail.jtp?type=node&node=11668&i=0> and [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=11668&i=1> could this
> be the same problem that you recently saw when working with other people?
>
> On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11668&i=2>> wrote:
>
>> Hi Guys,
>>
>> Can anyone please help me with this issue
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11668&i=3>> wrote:
>>
>> Hi Ted,
>>
>> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3
>> sink and the 3rd box is window operator followed by chained operators and a
>> s3 sink
>>
>> So in the details link section I can see that that S3 sink is taking time
>> for the acknowledgement and it is not even going to the window operator
>> chain.
>>
>> But as shown in the snapshot ,checkpoint id 19 did not get any
>> acknowledgement. Not sure what is causing the issue
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing
>> List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11668&i=4>> wrote:
>>
>> What did the More Details link say ?
>>
>> Thanks
>>
>> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11641&i=0>> wrote:
>> >
>> > Hi,
>> >
>> > I have kept the checkpointing interval to 6secs and minimum pause
>> between
>> > checkpoints to 5secs, while testing the pipeline I have observed that
>> that
>> > for some checkpoints it is taking long time , as you can see in the
>> attached
>> > snapshot checkpoint id 19 took the maximum time before it gets failed,
>> > although it has not received any acknowledgements, now during this
>> 10minutes
>> > the entire pipeline did not make any progress and no data was getting
>> > processed. (For Ex : In 13minutes 20M records were processed and when
>> the
>> > checkpoint took time there was no progress for the next 10minutes)
>> >
>> > I have even tried to set max checkpoint timeout to 3min, but in that
>> case as
>> > well multiple checkpoints were getting failed.
>> >
>> > I have set RocksDB FLASH_SSD_OPTION
>> > What could be the issue ?
>> >
>> > P.S. I am writing to 3 S3 sinks
>> >
>> > checkpointing_issue.PNG
>> > <http://apache-flink-user-mailing-list-archive.2336050.
>> n4.nabble.com/file/n11640/checkpointing_issue.PNG>
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-flink-user-
>> mailing-list-archive.2336050.n4.nabble.com/Checkpointing-
>> with-RocksDB-as-statebackend-tp11640.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.
>> n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-
>> tp11640p11641.html
>>
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=11668&i=5>
>>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-
> tp11640p11668.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11672.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Aljoscha Krettek <al...@apache.org>.

+Stefan Richter <s....@data-artisans.com> and +Stephan Ewen
<se...@apache.org> could this be the same problem that you recently saw
when working with other people?

On Wed, 15 Feb 2017 at 17:23 Vinay Patil <vi...@gmail.com> wrote:

> Hi Guys,
>
> Can anyone please help me with this issue
>
> Regards,
> Vinay Patil
>
> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <vi...@gmail.com>
> wrote:
>
> Hi Ted,
>
> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3
> sink and the 3rd box is window operator followed by chained operators and a
> s3 sink
>
> So in the details link section I can see that that S3 sink is taking time
> for the acknowledgement and it is not even going to the window operator
> chain.
>
> But as shown in the snapshot ,checkpoint id 19 did not get any
> acknowledgement. Not sure what is causing the issue
>
>
> Regards,
> Vinay Patil
>
> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing
> List archive.] <ml...@n4.nabble.com> wrote:
>
> What did the More Details link say ?
>
> Thanks
>
> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11641&i=0>> wrote:
> >
> > Hi,
> >
> > I have kept the checkpointing interval to 6secs and minimum pause
> between
> > checkpoints to 5secs, while testing the pipeline I have observed that
> that
> > for some checkpoints it is taking long time , as you can see in the
> attached
> > snapshot checkpoint id 19 took the maximum time before it gets failed,
> > although it has not received any acknowledgements, now during this
> 10minutes
> > the entire pipeline did not make any progress and no data was getting
> > processed. (For Ex : In 13minutes 20M records were processed and when
> the
> > checkpoint took time there was no progress for the next 10minutes)
> >
> > I have even tried to set max checkpoint timeout to 3min, but in that
> case as
> > well multiple checkpoints were getting failed.
> >
> > I have set RocksDB FLASH_SSD_OPTION
> > What could be the issue ?
> >
> > P.S. I am writing to 3 S3 sinks
> >
> > checkpointing_issue.PNG
> > <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG>
>
> >
> >
> >
> > --
> > View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html
>
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
>
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Guys,

Can anyone please help me with this issue

Regards,
Vinay Patil

On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <vi...@gmail.com>
wrote:

> Hi Ted,
>
> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3
> sink and the 3rd box is window operator followed by chained operators and a
> s3 sink
>
> So in the details link section I can see that that S3 sink is taking time
> for the acknowledgement and it is not even going to the window operator
> chain.
>
> But as shown in the snapshot ,checkpoint id 19 did not get any
> acknowledgement. Not sure what is causing the issue
>
> Regards,
> Vinay Patil
>
> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing
> List archive.] <ml...@n4.nabble.com> wrote:
>
>> What did the More Details link say ?
>>
>> Thanks
>>
>> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=11641&i=0>> wrote:
>> >
>> > Hi,
>> >
>> > I have kept the checkpointing interval to 6secs and minimum pause
>> between
>> > checkpoints to 5secs, while testing the pipeline I have observed that
>> that
>> > for some checkpoints it is taking long time , as you can see in the
>> attached
>> > snapshot checkpoint id 19 took the maximum time before it gets failed,
>> > although it has not received any acknowledgements, now during this
>> 10minutes
>> > the entire pipeline did not make any progress and no data was getting
>> > processed. (For Ex : In 13minutes 20M records were processed and when
>> the
>> > checkpoint took time there was no progress for the next 10minutes)
>> >
>> > I have even tried to set max checkpoint timeout to 3min, but in that
>> case as
>> > well multiple checkpoints were getting failed.
>> >
>> > I have set RocksDB FLASH_SSD_OPTION
>> > What could be the issue ?
>> >
>> > P.S. I am writing to 3 S3 sinks
>> >
>> > checkpointing_issue.PNG
>> > <http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/file/n11640/checkpointing_issue.PNG>
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-flink-user-maili
>> ng-list-archive.2336050.n4.nabble.com/Checkpointing-with-
>> RocksDB-as-statebackend-tp11640.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html
>> To start a new topic under Apache Flink User Mailing List archive., email
>> ml-node+s2336050n1h83@n4.nabble.com
>> To unsubscribe from Apache Flink User Mailing List archive., click here
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
>> .
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Ted,

I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3
sink and the 3rd box is window operator followed by chained operators and a
s3 sink

So in the details link section I can see that that S3 sink is taking time
for the acknowledgement and it is not even going to the window operator
chain.

But as shown in the snapshot ,checkpoint id 19 did not get any
acknowledgement. Not sure what is causing the issue

Regards,
Vinay Patil

On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing List
archive.] <ml...@n4.nabble.com> wrote:

> What did the More Details link say ?
>
> Thanks
>
> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=11641&i=0>> wrote:
> >
> > Hi,
> >
> > I have kept the checkpointing interval to 6secs and minimum pause
> between
> > checkpoints to 5secs, while testing the pipeline I have observed that
> that
> > for some checkpoints it is taking long time , as you can see in the
> attached
> > snapshot checkpoint id 19 took the maximum time before it gets failed,
> > although it has not received any acknowledgements, now during this
> 10minutes
> > the entire pipeline did not make any progress and no data was getting
> > processed. (For Ex : In 13minutes 20M records were processed and when
> the
> > checkpoint took time there was no progress for the next 10minutes)
> >
> > I have even tried to set max checkpoint timeout to 3min, but in that
> case as
> > well multiple checkpoints were getting failed.
> >
> > I have set RocksDB FLASH_SSD_OPTION
> > What could be the issue ?
> >
> > P.S. I am writing to 3 S3 sinks
> >
> > checkpointing_issue.PNG
> > <http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/file/n11640/checkpointing_issue.PNG>
> >
> >
> >
> > --
> > View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Checkpointing-
> with-RocksDB-as-statebackend-tp11640.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-
> tp11640p11641.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11643.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Ted Yu <yu...@gmail.com>.

What did the More Details link say ?

Thanks 

> On Feb 15, 2017, at 3:11 AM, vinay patil <vi...@gmail.com> wrote:
> 
> Hi,
> 
> I have kept the checkpointing interval to 6secs and minimum pause between
> checkpoints to 5secs, while testing the pipeline I have observed that that
> for some checkpoints it is taking long time , as you can see in the attached
> snapshot checkpoint id 19 took the maximum time before it gets failed,
> although it has not received any acknowledgements, now during this 10minutes
> the entire pipeline did not make any progress and no data was getting
> processed. (For Ex : In 13minutes 20M records were processed and when the
> checkpoint took time there was no progress for the next 10minutes)
> 
> I have even tried to set max checkpoint timeout to 3min, but in that case as
> well multiple checkpoints were getting failed.
> 
> I have set RocksDB FLASH_SSD_OPTION 
> What could be the issue ? 
> 
> P.S. I am writing to 3 S3 sinks 
> 
> checkpointing_issue.PNG
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG>  
> 
> 
> 
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.