You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Stephan Ewen <se...@apache.org> on 2017/03/01 11:10:46 UTC

Re: Checkpointing with RocksDB as statebackend

@vinay  Can you try to not set the buffer timeout at all? I am actually not
sure what would be the effect of setting it to a negative value, that can
be a cause of problems...


On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <sw...@mediamath.com>
wrote:

> Vinay,
>
>
>
> The bucketing sink performs rename operations during the checkpoint and if
> it tries to rename a file that is not yet consistent that would cause a
> FileNotFound exception which would fail the checkpoint.
>
>
>
> Stephan,
>
>
>
> Currently my aws fork contains some very specific assumptions about the
> pipeline that will in general only hold for my pipeline. This is because
> there were still some open questions that  I had about how to solve
> consistency issues in the general case. I will comment on the Jira issue
> with more specific.
>
>
>
> Seth Wiesman
>
>
>
> *From: *vinay patil <vi...@gmail.com>
> *Reply-To: *"user@flink.apache.org" <us...@flink.apache.org>
> *Date: *Monday, February 27, 2017 at 1:05 PM
> *To: *"user@flink.apache.org" <us...@flink.apache.org>
>
> *Subject: *Re: Checkpointing with RocksDB as statebackend
>
>
>
> Hi Seth,
>
> Thank you for your suggestion.
>
> But if the issue is only related to S3, then why does this happen when I
> replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS
> only )
>
> Stephan,
>
> Another issue I see is when I set env.setBufferTimeout(-1) , and keep the
> checkpoint interval to 10minutes, I have observed that nothing gets written
> to sink (tried with S3 as well as HDFS), atleast I was expecting pending
> files here.
>
> This issue gets worst when checkpointing is disabled  as nothing is
> written.
>
>
>
>
> Regards,
>
> Vinay Patil
>
>
>
> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]> wrote:
>
> Hi Seth!
>
>
>
> Wow, that is an awesome approach.
>
>
>
> We have actually seen these issues as well and we are looking to
> eventually implement our own S3 file system (and circumvent Hadoop's S3
> connector that Flink currently relies on): https://issues.apache.
> org/jira/browse/FLINK-5706
>
>
>
> Do you think your patch would be a good starting point for that and would
> you be willing to share it?
>
>
>
> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible to
> fork officially, if necessary...
>
>
>
> Greetings,
>
> Stephan
>
>
>
>
>
>
>
> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>
> Just wanted to throw in my 2cts.
>
>
>
> I’ve been running pipelines with similar state size using rocksdb which
> externalize to S3 and bucket to S3. I was getting stalls like this and
> ended up tracing the problem to S3 and the bucketing sink. The solution was
> two fold:
>
>
>
> 1)       I forked hadoop-aws and have it treat flink as a source of
> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
> Instead I say that if flink believes that a file exists on S3 and we don’t
> see it then I am going to trust that flink is in a consistent state and S3
> is not. In this case, various operations will perform a back off and retry
> up to a certain number of times.
>
>
>
> 2)       The bucketing sink performs multiple renames over the lifetime
> of a file, occurring when a checkpoint starts and then again on
> notification after it completes. Due to S3’s consistency guarantees the
> second rename of file can never be assured to work and will eventually fail
> either during or after a checkpoint. Because there is no upper bound on the
> time it will take for a file on S3 to become consistent, retries cannot
> solve this specific problem as it could take upwards of many minutes to
> rename which would stall the entire pipeline. The only viable solution I
> could find was to write a custom sink which understands S3. Each writer
> will write file locally and then copy it to S3 on checkpoint. By only
> interacting with S3 once per file it can circumvent consistency issues all
> together.
>
>
>
> Hope this helps,
>
>
>
> Seth Wiesman
>
>
>
> *From: *vinay patil <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
> *Reply-To: *"[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
> *Date: *Saturday, February 25, 2017 at 10:50 AM
> *To: *"[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=4>"
> <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
> *Subject: *Re: Checkpointing with RocksDB as statebackend
>
>
>
> HI Stephan,
>
> Just to avoid the confusion here, I am using S3 sink for writing the data,
> and using HDFS for storing checkpoints.
>
> There are 2 core nodes (HDFS) and two task nodes on EMR
>
>
> I replaced s3 sink with HDFS for writing data in my last test.
>
> Let's say the checkpoint interval is 5 minutes, now within 5minutes of run
> the state size grows to 30GB ,  after checkpointing the 30GB state that is
> maintained in rocksDB has to be copied to HDFS, right ?  is this causing
> the pipeline to stall ?
>
>
> Regards,
>
> Vinay Patil
>
>
>
> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>
> Hi Stephan,
>
> To verify if S3 is making teh pipeline stall, I have replaced the S3 sink
> with HDFS and kept minimum pause between checkpoints to 5minutes, still I
> see the same issue with checkpoints getting failed.
>
> If I keep the  pause time to 20 seconds, all checkpoints are completed ,
> however there is a hit in overall throughput.
>
>
>
>
> Regards,
>
> Vinay Patil
>
>
>
> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]> wrote:
>
> Flink's state backends currently do a good number of "make sure this
> exists" operations on the file systems. Through Hadoop's S3 filesystem,
> that translates to S3 bucket list operations, where there is a limit in how
> many operation may happen per time interval. After that, S3 blocks.
>
>
>
> It seems that operations that are totally cheap on HDFS are hellishly
> expensive (and limited) on S3. It may be that you are affected by that.
>
>
>
> We are gradually trying to improve the behavior there and be more S3 aware.
>
>
>
> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>
>
>
> Best,
>
> Stephan
>
>
>
>
>
> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>
> Hi Stephan,
>
> So do you mean that S3 is causing the stall , as I have mentioned in my
> previous mail, I could not see any progress for 16minutes as checkpoints
> were getting failed continuously.
>
>
>
> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing List
> archive.]" <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>
> Hi Vinay!
>
>
>
> True, the operator state (like Kafka) is currently not asynchronously
> checkpointed.
>
>
>
> While it is rather small state, we have seen before that on S3 it can
> cause trouble, because S3 frequently stalls uploads of even data amounts as
> low as kilobytes due to its throttling policies.
>
>
>
> That would be a super important fix to add!
>
>
>
> Best,
>
> Stephan
>
>
>
>
>
> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>
> Hi,
>
> I have attached a snapshot for reference:
> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and 3
> it
> is stuck at the Kafka source after 50%
> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 is
> 15GB )
>
> Within 10minutes 15M records were processed, and for the next 16minutes the
> pipeline is stuck , I don't see any progress beyond 15M because of
> checkpoints getting failed consistently.
>
> <http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/file/n11882/Checkpointing_Failed.png>
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p11882.html
>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html
>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden
> email] <http://user/SendEmail.jtp?type=node&node=11887&i=1>
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ------------------------------
>
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html
>
> To start a new topic under Apache Flink User Mailing List archive., email
> [hidden email]
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
>
>
> ------------------------------
>
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p11943.html
>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden
> email]
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> ------------------------------
>
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

@ Stephan,

I am not using explicit Evictor in my code. I will try using the Fold
function if it does not break my existing functionality :)

@Robert : Thank you for your answer, yes I have already tried to set G1GC
 this morning using env.java.opts, it works.
Which is the recommended GC for Streaming application (running on YARN -
EMR ) ?

Regards,
Vinay Patil

On Thu, Mar 16, 2017 at 6:36 PM, rmetzger0 [via Apache Flink User Mailing
List archive.] <ml...@n4.nabble.com> wrote:

> Yes, you can change the GC using the env.java.opts parameter.
> We are not setting any GC on YARN.
>
> On Thu, Mar 16, 2017 at 1:50 PM, Stephan Ewen <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12244&i=0>> wrote:
>
>> The only immediate workaround is to use windows with "reduce" or "fold"
>> or "aggregate" and not "apply". And to not use an evictor.
>>
>> The good news is that I think we have a good way of fixing this soon,
>> making an adjustment in RocksDB.
>>
>> For the Yarn / g1gc question: Not 100% sure about that - you can check if
>> it used g1gc. If not, you may be able to pass this through the
>> "env.java.opts" parameter. (cc robert for confirmation)
>>
>> Stephan
>>
>>
>>
>> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12244&i=1>> wrote:
>>
>>> Hi Stephan,
>>>
>>> What can be the workaround for this ?
>>>
>>> Also need one confirmation : Is G1 GC used by default when running the
>>> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
>>> JAVA8)
>>>
>>>
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>>
>>>> Hi Vinay!
>>>>
>>>> Savepoints also call the same problematic RocksDB function,
>>>> unfortunately.
>>>>
>>>> We will have a fix next month. We either (1) get a patched RocksDB
>>>> version or we (2) implement a different pattern for ListState in Flink.
>>>>
>>>> (1) would be the better solution, so we are waiting for a response from
>>>> the RocksDB folks. (2) is always possible if we cannot get a fix from
>>>> RocksDB.
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> Thank you for making me aware of this.
>>>>>
>>>>> Yes I am using a window without reduce function (Apply function). The
>>>>> discussion happening on JIRA is exactly what I am observing, consistent
>>>>> failure of checkpoints after some time and the stream halts.
>>>>>
>>>>> We want to go live in next month, not sure how this will affect in
>>>>> production as we are going to get above 200 million data.
>>>>>
>>>>> As a workaround can I take the savepoint while the pipeline is running
>>>>> ? Let's say if I take savepoint after every 30minutes, will it work ?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Vinay Patil
>>>>>
>>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>>>> Mailing List archive.] <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>>
>>>>>> The issue in Flink is https://issues.apache.org/j
>>>>>> ira/browse/FLINK-5756
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>>
>>>>>>> Hi Vinay,
>>>>>>>
>>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>>> ook/rocksdb/issues/1988.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>>
>>>>>>> Hi Stephan,
>>>>>>>
>>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>>> conditions you mentioned.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vishnu
>>>>>>>
>>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>>
>>>>>>>> Hi Vinay!
>>>>>>>>
>>>>>>>> We just discovered a bug in RocksDB. The bug affects windows
>>>>>>>> without reduce() or fold(), windows with evictors, and ListState.
>>>>>>>>
>>>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>>>> and the snapshots.
>>>>>>>>
>>>>>>>> We are reaching out to the RocksDB folks and looking for
>>>>>>>> workarounds in Flink.
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>>
>>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>>> value, that can be a cause of problems...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>>
>>>>>>>>>> Vinay,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Stephan,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>>> with more specific.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Seth Wiesman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Seth,
>>>>>>>>>>
>>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>>
>>>>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>>>>> HDFS only )
>>>>>>>>>>
>>>>>>>>>> Stephan,
>>>>>>>>>>
>>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>>> expecting pending files here.
>>>>>>>>>>
>>>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing
>>>>>>>>>> is written.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Seth!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you think your patch would be a good starting point for that
>>>>>>>>>> and would you be willing to share it?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>>>>> was two fold:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source
>>>>>>>>>> of truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>>>>> up to a certain number of times.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>>> together.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Seth Wiesman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> HI Stephan,
>>>>>>>>>>
>>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>>>
>>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>>
>>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Stephan,
>>>>>>>>>>
>>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the
>>>>>>>>>> S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>>>>
>>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>>> blocks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>>> that.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We are gradually trying to improve the behavior there and be more
>>>>>>>>>> S3 aware.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Stephan,
>>>>>>>>>>
>>>>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned
>>>>>>>>>> in my previous mail, I could not see any progress for 16minutes as
>>>>>>>>>> checkpoints were getting failed continuously.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Vinay!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID
>>>>>>>>>> 2 and 3 it
>>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>>> source 2 is
>>>>>>>>>> 15GB )
>>>>>>>>>>
>>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>>> 16minutes the
>>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>>
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11885.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>> click here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11891.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>> click here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11943.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>>> here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p12209.html
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>> at Nabble.com.
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> If you reply to this email, your message will be added to the
>>>> discussion below:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p12225.html
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>> ------------------------------
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>
>>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p12244.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12246.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Aljoscha Krettek <al...@apache.org>.

That’s great to hear!

Maybe it would make sense to add these defaults to Flink, if they don’t otherwise degrade performance.

Best,
Aljoscha

> On 29. Jun 2017, at 22:44, Vinay Patil <vi...@gmail.com> wrote:
> 
> Hi Guys,
> 
> I am able to overcome the physical memory consumption issue by setting the following options of RocksDB:
> 
> DBOptions: 
>      along with the FLASH_SSD_OPTIONS added the following:
>      maxBackgroundCompactions(4)
>     
> ColumnFamilyOptions:
>   max_buffer_size : 512 MB
>   block_cache_size : 128 MB
>   max_write_buffer_number : 5
>   minimum_buffer_number_to_merge : 2
>   cacheIndexAndFilterBlocks : true
>   optimizeFilterForHits: true
> 
> I am going to modify these options to get the desired performance.
> 
> I have set the DbLogDir("/var/tmp/") and StatsSumpPeriodicSec but I only got the configurations set in the log file present in var/tmp/ 
> 
> Where will I get the RocksDB statistics if I set createStatistics ?
> 
> Let me know if any other configurations will help to get better performance. Now the physical memory is slowly getting increased and I can see the drop in the graph ( this means flushing is taking place at regular intervals )
> 
> Regards,
> Vinay Patil
> 
> On Thu, Jun 29, 2017 at 9:13 PM, Vinay Patil <vinay18.patil@gmail.com <ma...@gmail.com>> wrote:
> The state size is not that huge. On the Flink UI when it showed the data sent as 4GB , the physical memory usage was close to 90GB ..
> 
> I will re-run by setting the Flushing options of RocksDB because I am facing this issue on 1.2.0 as well.
> 
> Regards,
> Vinay Patil
> 
> On Thu, Jun 29, 2017 at 9:03 PM, Aljoscha Krettek <aljoscha@apache.org <ma...@apache.org>> wrote:
> Yup, I got that. I’m just wondering whether this occurs only with enabled checkpointing or also when checkpointing is disabled.
> 
>> On 29. Jun 2017, at 17:31, Vinay Patil <vinay18.patil@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Aljoscha,
>> 
>> Yes I have tried with 1.2.1 and 1.3.0 , facing the same issue.
>> 
>> The issue is not of Heap memory , it is of the Off-Heap memory that is getting used  ( please refer to the earlier snapshot I have attached in which the graph keeps on growing ).
>> 
>> 
>> Regards,
>> Vinay Patil
>> 
>> On Thu, Jun 29, 2017 at 8:55 PM, Aljoscha Krettek <aljoscha@apache.org <ma...@apache.org>> wrote:
>> Just a quick remark: Flink 1.3.0 and 1.2.1 always use FRocksDB, you shouldn’t manually specify that.
>> 
>>> On 29. Jun 2017, at 17:20, Vinay Patil <vinay18.patil@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi Gerry,
>>> 
>>> Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling incremental checkpointing.
>>> 
>>> You can add FRocksDB dependency as shown here : https://github.com/apache/flink/pull/3704 <https://github.com/apache/flink/pull/3704>
>>> 
>>> We will have to set some RocksDB parameters  to get this working.
>>> 
>>> @Stefan or @Stephan : can you please help in resolving this issue
>>> 
>>> Regards,
>>> Vinay Patil
>>> 
>>> On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing List archive.] <ml+s2336050n14063h5@n4.nabble.com <ma...@n4.nabble.com>> wrote:
>>> Hi, Vinay, 
>>>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder how to use FRocksDB as you mentioned above. Thanks. 
>>> 
>>> If you reply to this email, your message will be added to the discussion below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html>
>>> To start a new topic under Apache Flink User Mailing List archive., email ml+s2336050n1h83@n4.nabble.com <ma...@n4.nabble.com> 
>>> To unsubscribe from Apache Flink User Mailing List archive., click here <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>.
>>> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>> 
>> 
> 
> 
>

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Guys,

I am able to overcome the physical memory consumption issue by setting the
following options of RocksDB:

*DBOptions: *
     along with the FLASH_SSD_OPTIONS added the following:
     maxBackgroundCompactions(4)

*ColumnFamilyOptions:*
  max_buffer_size : 512 MB
  block_cache_size : 128 MB
  max_write_buffer_number : 5
  minimum_buffer_number_to_merge : 2
  cacheIndexAndFilterBlocks : true
  optimizeFilterForHits: true

I am going to modify these options to get the desired performance.

I have set the DbLogDir("/var/tmp/") and StatsSumpPeriodicSec but I only
got the configurations set in the log file present in var/tmp/

Where will I get the RocksDB statistics if I set createStatistics ?

Let me know if any other configurations will help to get better
performance. Now the physical memory is slowly getting increased and I can
see the drop in the graph ( this means flushing is taking place at regular
intervals )

Regards,
Vinay Patil

On Thu, Jun 29, 2017 at 9:13 PM, Vinay Patil <vi...@gmail.com>
wrote:

> The state size is not that huge. On the Flink UI when it showed the data
> sent as 4GB , the physical memory usage was close to 90GB ..
>
> I will re-run by setting the Flushing options of RocksDB because I am
> facing this issue on 1.2.0 as well.
>
> Regards,
> Vinay Patil
>
> On Thu, Jun 29, 2017 at 9:03 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Yup, I got that. I’m just wondering whether this occurs only with enabled
>> checkpointing or also when checkpointing is disabled.
>>
>> On 29. Jun 2017, at 17:31, Vinay Patil <vi...@gmail.com> wrote:
>>
>> Hi Aljoscha,
>>
>> Yes I have tried with 1.2.1 and 1.3.0 , facing the same issue.
>>
>> The issue is not of Heap memory , it is of the Off-Heap memory that is
>> getting used  ( please refer to the earlier snapshot I have attached in
>> which the graph keeps on growing ).
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Thu, Jun 29, 2017 at 8:55 PM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Just a quick remark: Flink 1.3.0 and 1.2.1 always use FRocksDB, you
>>> shouldn’t manually specify that.
>>>
>>> On 29. Jun 2017, at 17:20, Vinay Patil <vi...@gmail.com> wrote:
>>>
>>> Hi Gerry,
>>>
>>> Even I have faced this issue on 1.3.0 even by using FRocksDB and
>>> enabling incremental checkpointing.
>>>
>>> You can add FRocksDB dependency as shown here :
>>> https://github.com/apache/flink/pull/3704
>>>
>>> We will have to set some RocksDB parameters  to get this working.
>>>
>>> @Stefan or @Stephan : can you please help in resolving this issue
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User
>>> Mailing List archive.] <ml...@n4.nabble.com> wrote:
>>>
>>>> Hi, Vinay,
>>>>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder
>>>> how to use FRocksDB as you mentioned above. Thanks.
>>>>
>>>> ------------------------------
>>>> If you reply to this email, your message will be added to the
>>>> discussion below:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p14063.html
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email ml+s2336050n1h83@n4.nabble.com
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

The state size is not that huge. On the Flink UI when it showed the data
sent as 4GB , the physical memory usage was close to 90GB ..

I will re-run by setting the Flushing options of RocksDB because I am
facing this issue on 1.2.0 as well.

Regards,
Vinay Patil

On Thu, Jun 29, 2017 at 9:03 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Yup, I got that. I’m just wondering whether this occurs only with enabled
> checkpointing or also when checkpointing is disabled.
>
> On 29. Jun 2017, at 17:31, Vinay Patil <vi...@gmail.com> wrote:
>
> Hi Aljoscha,
>
> Yes I have tried with 1.2.1 and 1.3.0 , facing the same issue.
>
> The issue is not of Heap memory , it is of the Off-Heap memory that is
> getting used  ( please refer to the earlier snapshot I have attached in
> which the graph keeps on growing ).
>
>
> Regards,
> Vinay Patil
>
> On Thu, Jun 29, 2017 at 8:55 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Just a quick remark: Flink 1.3.0 and 1.2.1 always use FRocksDB, you
>> shouldn’t manually specify that.
>>
>> On 29. Jun 2017, at 17:20, Vinay Patil <vi...@gmail.com> wrote:
>>
>> Hi Gerry,
>>
>> Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling
>> incremental checkpointing.
>>
>> You can add FRocksDB dependency as shown here :
>> https://github.com/apache/flink/pull/3704
>>
>> We will have to set some RocksDB parameters  to get this working.
>>
>> @Stefan or @Stephan : can you please help in resolving this issue
>>
>> Regards,
>> Vinay Patil
>>
>> On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing
>> List archive.] <ml...@n4.nabble.com> wrote:
>>
>>> Hi, Vinay,
>>>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder
>>> how to use FRocksDB as you mentioned above. Thanks.
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email ml+s2336050n1h83@n4.nabble.com
>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
>>> .
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Aljoscha Krettek <al...@apache.org>.

Yup, I got that. I’m just wondering whether this occurs only with enabled checkpointing or also when checkpointing is disabled.

> On 29. Jun 2017, at 17:31, Vinay Patil <vi...@gmail.com> wrote:
> 
> Hi Aljoscha,
> 
> Yes I have tried with 1.2.1 and 1.3.0 , facing the same issue.
> 
> The issue is not of Heap memory , it is of the Off-Heap memory that is getting used  ( please refer to the earlier snapshot I have attached in which the graph keeps on growing ).
> 
> 
> Regards,
> Vinay Patil
> 
> On Thu, Jun 29, 2017 at 8:55 PM, Aljoscha Krettek <aljoscha@apache.org <ma...@apache.org>> wrote:
> Just a quick remark: Flink 1.3.0 and 1.2.1 always use FRocksDB, you shouldn’t manually specify that.
> 
>> On 29. Jun 2017, at 17:20, Vinay Patil <vinay18.patil@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Gerry,
>> 
>> Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling incremental checkpointing.
>> 
>> You can add FRocksDB dependency as shown here : https://github.com/apache/flink/pull/3704 <https://github.com/apache/flink/pull/3704>
>> 
>> We will have to set some RocksDB parameters  to get this working.
>> 
>> @Stefan or @Stephan : can you please help in resolving this issue
>> 
>> Regards,
>> Vinay Patil
>> 
>> On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing List archive.] <ml+s2336050n14063h5@n4.nabble.com <ma...@n4.nabble.com>> wrote:
>> Hi, Vinay, 
>>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder how to use FRocksDB as you mentioned above. Thanks. 
>> 
>> If you reply to this email, your message will be added to the discussion below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html>
>> To start a new topic under Apache Flink User Mailing List archive., email ml+s2336050n1h83@n4.nabble.com <ma...@n4.nabble.com> 
>> To unsubscribe from Apache Flink User Mailing List archive., click here <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>.
>> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> 
>

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Aljoscha,

Yes I have tried with 1.2.1 and 1.3.0 , facing the same issue.

The issue is not of Heap memory , it is of the Off-Heap memory that is
getting used  ( please refer to the earlier snapshot I have attached in
which the graph keeps on growing ).


Regards,
Vinay Patil

On Thu, Jun 29, 2017 at 8:55 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Just a quick remark: Flink 1.3.0 and 1.2.1 always use FRocksDB, you
> shouldn’t manually specify that.
>
> On 29. Jun 2017, at 17:20, Vinay Patil <vi...@gmail.com> wrote:
>
> Hi Gerry,
>
> Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling
> incremental checkpointing.
>
> You can add FRocksDB dependency as shown here : https://github.com/apache/
> flink/pull/3704
>
> We will have to set some RocksDB parameters  to get this working.
>
> @Stefan or @Stephan : can you please help in resolving this issue
>
> Regards,
> Vinay Patil
>
> On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing
> List archive.] <ml...@n4.nabble.com> wrote:
>
>> Hi, Vinay,
>>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder
>> how to use FRocksDB as you mentioned above. Thanks.
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p14063.html
>> To start a new topic under Apache Flink User Mailing List archive., email
>> ml+s2336050n1h83@n4.nabble.com
>> To unsubscribe from Apache Flink User Mailing List archive., click here
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
>> .
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Aljoscha Krettek <al...@apache.org>.

Just a quick remark: Flink 1.3.0 and 1.2.1 always use FRocksDB, you shouldn’t manually specify that.

> On 29. Jun 2017, at 17:20, Vinay Patil <vi...@gmail.com> wrote:
> 
> Hi Gerry,
> 
> Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling incremental checkpointing.
> 
> You can add FRocksDB dependency as shown here : https://github.com/apache/flink/pull/3704 <https://github.com/apache/flink/pull/3704>
> 
> We will have to set some RocksDB parameters  to get this working.
> 
> @Stefan or @Stephan : can you please help in resolving this issue
> 
> Regards,
> Vinay Patil
> 
> On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing List archive.] <ml+s2336050n14063h5@n4.nabble.com <ma...@n4.nabble.com>> wrote:
> Hi, Vinay, 
>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder how to use FRocksDB as you mentioned above. Thanks. 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html>
> To start a new topic under Apache Flink User Mailing List archive., email ml+s2336050n1h83@n4.nabble.com <ma...@n4.nabble.com> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Gerry,

Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling
incremental checkpointing.

You can add FRocksDB dependency as shown here :
https://github.com/apache/flink/pull/3704

We will have to set some RocksDB parameters  to get this working.

@Stefan or @Stephan : can you please help in resolving this issue

Regards,
Vinay Patil

On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing
List archive.] <ml...@n4.nabble.com> wrote:

> Hi, Vinay,
>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder
> how to use FRocksDB as you mentioned above. Thanks.
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14065.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Gerry,

Even I have faced this issue on 1.3.0 even by using FRocksDB and enabling
incremental checkpointing.

You can add FRocksDB dependency as shown here :
https://github.com/apache/flink/pull/3704

We will have to set some RocksDB parameters  to get this working.

@Stefan or @Stephan : can you please help in resolving this issue

Regards,
Vinay Patil

On Thu, Jun 29, 2017 at 6:01 PM, gerryzhou [via Apache Flink User Mailing
List archive.] <ml...@n4.nabble.com> wrote:

> Hi, Vinay,
>      I observed a similar problem in flink 1.3.0 with rocksdb. I wonder
> how to use FRocksDB as you mentioned above. Thanks.
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Aljoscha Krettek <al...@apache.org>.

Hi Vinay,

When you say HDFS usage is low and nothing is getting flushed to disk, what do you mean by that? RocksDB will not flush to disk, only checkpoints will get written to HDFS and then you can check in HDFS how big the checkpointed state actually is.

Have you tried running with this newer version of Flink without checkpointing? I.e. do you also see the growing heap memory there?

Best,
Aljoscha

> On 29. Jun 2017, at 14:31, gerryzhou <su...@163.com> wrote:
> 
> Hi, Vinay,
>     I observed a similar problem in flink 1.3.0 with rocksdb. I wonder how
> to use FRocksDB as you mentioned above. Thanks.
> 
> 
> 
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by gerryzhou <su...@163.com>.

Hi, Vinay,
     I observed a similar problem in flink 1.3.0 with rocksdb. I wonder how
to use FRocksDB as you mentioned above. Thanks.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14063.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Xiaogang,

Yes I have set that, I got the same issue. I don't see the graph coming
down. Also I checked the HDFS usage  , only 3GB is being used, that means
nothing is getting flushed to disk.

I think the parameters are not getting set properly. I am using FRocksDB ,
is it causing this error ?


Regards,
Vinay Patil

On Thu, Jun 29, 2017 at 7:30 AM, SHI Xiaogang <sh...@gmail.com>
wrote:

> Hi Vinay,
>
> We observed a similar problem before. We found that RocksDB keeps a lot of
> index and filter blocks in memory. With the growth in state size (in our
> cases, most states are only cleared in windowed streams), these blocks will
> occupy much more memory.
>
> We now let RocksDB put these blocks into block cache (via
> setCacheIndexAndFilterBlocks), and limit the memory usage of RocksDB with
> block cache size. Performance may be degraded, but TMs can avoid being
> killed by YARN for overused memory.
>
> This may not be the same cause of your problem, but it may be helpful.
>
> Regards,
> Xiaogang
>
>
>
>
>
>
> 2017-06-28 23:26 GMT+08:00 Vinay Patil <vi...@gmail.com>:
>
>> Hi Aljoscha,
>>
>> I am using event Time based tumbling window wherein the allowedLateness
>> is kept to Long.MAX_VALUE and I have custom trigger which is similar to
>> 1.0.3 where Flink was not discarding late elements (we have discussed this
>> scenario before).
>>
>> The watermark is working correctly because I have validated the records
>> earlier.
>>
>> I was doubtful that the RocksDB statebackend is not set , but in the logs
>> I can clearly see that RocksDB is initialized successfully, so that should
>> not be an issue.
>>
>> Even I have not changed any major  code from the last performance test I
>> had done.
>>
>> The snapshot I had attached is of Off-heap memory, I have only assigned
>> 12GB heap memory per TM
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Jun 28, 2017 at 8:43 PM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> Just a quick question, because I’m not sure whether this came up in the
>>> discussion so far: what kind of windows are you using? Processing
>>> time/event time? Sliding Windows/Tumbling Windows? Allowed lateness? How is
>>> the watermark behaving?
>>>
>>> Also, the latest memory usage graph you sent, is that heap memory or
>>> off-heap memory or both?
>>>
>>> Best,
>>> Aljoscha
>>>
>>> > On 27. Jun 2017, at 11:45, vinay patil <vi...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Stephan,
>>> >
>>> > I am observing similar issue with Flink 1.2.1
>>> >
>>> > The memory is continuously increasing and data is not getting flushed
>>> to
>>> > disk.
>>> >
>>> > I have attached the snapshot for reference.
>>> >
>>> > Also the data processed till now is only 17GB and above 120GB memory is
>>> > getting used.
>>> >
>>> > Is there any change wrt RocksDB configurations
>>> >
>>> > <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>> bble.com/file/n14013/TM_Memory_Usage.png>
>>> >
>>> > Regards,
>>> > Vinay Patil
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: http://apache-flink-user-maili
>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>> RocksDB-as-statebackend-tp11752p14013.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>>
>>>
>>
>

Re: Checkpointing with RocksDB as statebackend

Posted by SHI Xiaogang <sh...@gmail.com>.

Hi Vinay,

We observed a similar problem before. We found that RocksDB keeps a lot of
index and filter blocks in memory. With the growth in state size (in our
cases, most states are only cleared in windowed streams), these blocks will
occupy much more memory.

We now let RocksDB put these blocks into block cache (via
setCacheIndexAndFilterBlocks), and limit the memory usage of RocksDB with
block cache size. Performance may be degraded, but TMs can avoid being
killed by YARN for overused memory.

This may not be the same cause of your problem, but it may be helpful.

Regards,
Xiaogang






2017-06-28 23:26 GMT+08:00 Vinay Patil <vi...@gmail.com>:

> Hi Aljoscha,
>
> I am using event Time based tumbling window wherein the allowedLateness is
> kept to Long.MAX_VALUE and I have custom trigger which is similar to 1.0.3
> where Flink was not discarding late elements (we have discussed this
> scenario before).
>
> The watermark is working correctly because I have validated the records
> earlier.
>
> I was doubtful that the RocksDB statebackend is not set , but in the logs
> I can clearly see that RocksDB is initialized successfully, so that should
> not be an issue.
>
> Even I have not changed any major  code from the last performance test I
> had done.
>
> The snapshot I had attached is of Off-heap memory, I have only assigned
> 12GB heap memory per TM
>
>
> Regards,
> Vinay Patil
>
> On Wed, Jun 28, 2017 at 8:43 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Hi,
>>
>> Just a quick question, because I’m not sure whether this came up in the
>> discussion so far: what kind of windows are you using? Processing
>> time/event time? Sliding Windows/Tumbling Windows? Allowed lateness? How is
>> the watermark behaving?
>>
>> Also, the latest memory usage graph you sent, is that heap memory or
>> off-heap memory or both?
>>
>> Best,
>> Aljoscha
>>
>> > On 27. Jun 2017, at 11:45, vinay patil <vi...@gmail.com> wrote:
>> >
>> > Hi Stephan,
>> >
>> > I am observing similar issue with Flink 1.2.1
>> >
>> > The memory is continuously increasing and data is not getting flushed to
>> > disk.
>> >
>> > I have attached the snapshot for reference.
>> >
>> > Also the data processed till now is only 17GB and above 120GB memory is
>> > getting used.
>> >
>> > Is there any change wrt RocksDB configurations
>> >
>> > <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>> bble.com/file/n14013/TM_Memory_Usage.png>
>> >
>> > Regards,
>> > Vinay Patil
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-flink-user-maili
>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>> RocksDB-as-statebackend-tp11752p14013.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Aljoscha,

I am using event Time based tumbling window wherein the allowedLateness is
kept to Long.MAX_VALUE and I have custom trigger which is similar to 1.0.3
where Flink was not discarding late elements (we have discussed this
scenario before).

The watermark is working correctly because I have validated the records
earlier.

I was doubtful that the RocksDB statebackend is not set , but in the logs I
can clearly see that RocksDB is initialized successfully, so that should
not be an issue.

Even I have not changed any major  code from the last performance test I
had done.

The snapshot I had attached is of Off-heap memory, I have only assigned
12GB heap memory per TM

Regards,
Vinay Patil

On Wed, Jun 28, 2017 at 8:43 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
>
> Just a quick question, because I’m not sure whether this came up in the
> discussion so far: what kind of windows are you using? Processing
> time/event time? Sliding Windows/Tumbling Windows? Allowed lateness? How is
> the watermark behaving?
>
> Also, the latest memory usage graph you sent, is that heap memory or
> off-heap memory or both?
>
> Best,
> Aljoscha
>
> > On 27. Jun 2017, at 11:45, vinay patil <vi...@gmail.com> wrote:
> >
> > Hi Stephan,
> >
> > I am observing similar issue with Flink 1.2.1
> >
> > The memory is continuously increasing and data is not getting flushed to
> > disk.
> >
> > I have attached the snapshot for reference.
> >
> > Also the data processed till now is only 17GB and above 120GB memory is
> > getting used.
> >
> > Is there any change wrt RocksDB configurations
> >
> > <http://apache-flink-user-mailing-list-archive.2336050.n4.
> nabble.com/file/n14013/TM_Memory_Usage.png>
> >
> > Regards,
> > Vinay Patil
> >
> >
> >
> > --
> > View this message in context: http://apache-flink-user-maili
> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-
> with-RocksDB-as-statebackend-tp11752p14013.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

Just a quick question, because I’m not sure whether this came up in the discussion so far: what kind of windows are you using? Processing time/event time? Sliding Windows/Tumbling Windows? Allowed lateness? How is the watermark behaving?

Also, the latest memory usage graph you sent, is that heap memory or off-heap memory or both?

Best,
Aljoscha

> On 27. Jun 2017, at 11:45, vinay patil <vi...@gmail.com> wrote:
> 
> Hi Stephan,
> 
> I am observing similar issue with Flink 1.2.1
> 
> The memory is continuously increasing and data is not getting flushed to
> disk.
> 
> I have attached the snapshot for reference.
> 
> Also the data processed till now is only 17GB and above 120GB memory is
> getting used.
> 
> Is there any change wrt RocksDB configurations
> 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n14013/TM_Memory_Usage.png> 
> 
> Regards,
> Vinay Patil
> 
> 
> 
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14013.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Stephan,

I am observing similar issue with Flink 1.2.1

The memory is continuously increasing and data is not getting flushed to
disk.

I have attached the snapshot for reference.

Also the data processed till now is only 17GB and above 120GB memory is
getting used.

Is there any change wrt RocksDB configurations

<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n14013/TM_Memory_Usage.png> 

Regards,
Vinay Patil



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p14013.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Vinay Patil <vi...@gmail.com>.

Hi Stephan,

I have upgraded to Flink 1.3.0 to test RocksDB  with incremental
checkpointing (PredefinedOptions used is FLASH_SSD_OPTIMIZED)

I am currently creating a YARN session and running the job on EMR having
r3.4xlarge instances (122GB of memory), I have observed that it is
utilizing almost all memory. This was not happening with previous version ;
maximum 30GB was getting utilized.

Because of this issue the job manager was killed and the job failed.

Is there any other configurations I have to do ?

P.S I am currently using FRocksDB


Regards,
Vinay Patil

On Fri, May 5, 2017 at 1:01 PM, Vinay Patil <vi...@gmail.com> wrote:

> Hi Stephan,
>
> I tested the pipeline with the FRocksDB dependency  (with SSD_OPTIMIZED
> option), none of the checkpoints were failed.
>
> For checkpointing 10GB of state it took 45secs which is better than the
> previous results.
>
> Let me know if there are any other configurations which will help to get
> better results.
>
> Regards,
> Vinay Patil
>
> On Thu, May 4, 2017 at 10:05 PM, Vinay Patil <vi...@gmail.com>
> wrote:
>
>> Hi Stephan,
>>
>> I see that the RocksDb issue is solved by having a separate FRocksDB
>> dependency.
>>
>> I have added this dependency as discussed on the JIRA. Is it the only
>> thing that we have to do or we have to change the code  for setting RocksDB
>> state backend as well ?
>>
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Tue, Mar 28, 2017 at 1:20 PM, Stefan Richter [via Apache Flink User
>> Mailing List archive.] <ml...@n4.nabble.com> wrote:
>>
>>> Hi,
>>>
>>> I was able to come up with a custom build of RocksDB yesterday that
>>> seems to fix the problems. I still have to build the native code for
>>> different platforms and then test it. I cannot make promises about the
>>> 1.2.1 release, but I would be optimistic that this will make it in.
>>>
>>> Best,
>>> Stefan
>>>
>>> Am 27.03.2017 um 19:12 schrieb vinay patil <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12429&i=0>>:
>>>
>>> Hi Stephan,
>>>
>>> Just an update, last week I did a run with state size close to 18GB, I
>>> did not observe the pipeline getting stopped in between with G1GC enabled.
>>>
>>> I had observed checkpoint failures when the state size was close to 38GB
>>> (but in this case G1GC was not enabled)
>>>
>>> Is it possible to get the RocksDB fix in 1.2.1 so that I can test it out.
>>>
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Sat, Mar 18, 2017 at 12:25 AM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <<a href="x-msg://1/user/SendEmail
>>> .jtp?type=node&amp;node=12425&amp;i=0" target="_top" rel="nofollow"
>>> link="external" class="">[hidden email]> wrote:
>>>
>>>> @vinay Let's see how fast we get this fix in - I hope yes. It may
>>>> depend also a bit on the RocksDB community.
>>>>
>>>> In any case, if it does not make it in, we can do a 1.2.2 release
>>>> immediately after (I think the problem is big enough to warrant that), or
>>>> at least release a custom version of the RocksDB state backend that
>>>> includes the fix.
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Fri, Mar 17, 2017 at 5:51 PM, vinay patil <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=12276&i=0>> wrote:
>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> Is the performance related change  of RocksDB going to be part of
>>>>> Flink 1.2.1 ?
>>>>>
>>>>> Regards,
>>>>> Vinay Patil
>>>>>
>>>>> On Thu, Mar 16, 2017 at 6:13 PM, Stephan Ewen [via Apache Flink User
>>>>> Mailing List archive.] <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=12274&i=0>> wrote:
>>>>>
>>>>>> The only immediate workaround is to use windows with "reduce" or
>>>>>> "fold" or "aggregate" and not "apply". And to not use an evictor.
>>>>>>
>>>>>> The good news is that I think we have a good way of fixing this soon,
>>>>>> making an adjustment in RocksDB.
>>>>>>
>>>>>> For the Yarn / g1gc question: Not 100% sure about that - you can
>>>>>> check if it used g1gc. If not, you may be able to pass this through the
>>>>>> "env.java.opts" parameter. (cc robert for confirmation)
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=12243&i=0>> wrote:
>>>>>>
>>>>>>> Hi Stephan,
>>>>>>>
>>>>>>> What can be the workaround for this ?
>>>>>>>
>>>>>>> Also need one confirmation : Is G1 GC used by default when running
>>>>>>> the pipeline on YARN. (I see a thread of 2015 where G1 is used by default
>>>>>>> for JAVA8)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vinay Patil
>>>>>>>
>>>>>>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink
>>>>>>> User Mailing List archive.] <[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>>>>>>
>>>>>>>> Hi Vinay!
>>>>>>>>
>>>>>>>> Savepoints also call the same problematic RocksDB function,
>>>>>>>> unfortunately.
>>>>>>>>
>>>>>>>> We will have a fix next month. We either (1) get a patched RocksDB
>>>>>>>> version or we (2) implement a different pattern for ListState in Flink.
>>>>>>>>
>>>>>>>> (1) would be the better solution, so we are waiting for a response
>>>>>>>> from the RocksDB folks. (2) is always possible if we cannot get a fix from
>>>>>>>> RocksDB.
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>>>>>>
>>>>>>>>> Hi Stephan,
>>>>>>>>>
>>>>>>>>> Thank you for making me aware of this.
>>>>>>>>>
>>>>>>>>> Yes I am using a window without reduce function (Apply function).
>>>>>>>>> The discussion happening on JIRA is exactly what I am observing, consistent
>>>>>>>>> failure of checkpoints after some time and the stream halts.
>>>>>>>>>
>>>>>>>>> We want to go live in next month, not sure how this will affect in
>>>>>>>>> production as we are going to get above 200 million data.
>>>>>>>>>
>>>>>>>>> As a workaround can I take the savepoint while the pipeline is
>>>>>>>>> running ? Let's say if I take savepoint after every 30minutes, will it work
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink
>>>>>>>>> User Mailing List archive.] <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>>>>>>
>>>>>>>>>> The issue in Flink is https://issues.apache.org/j
>>>>>>>>>> ira/browse/FLINK-5756
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vinay,
>>>>>>>>>>>
>>>>>>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>>>>>>> ook/rocksdb/issues/1988.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>>>>>>
>>>>>>>>>>> Hi Stephan,
>>>>>>>>>>>
>>>>>>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>>>>>>> conditions you mentioned.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Vishnu
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vinay!
>>>>>>>>>>>>
>>>>>>>>>>>> We just discovered a bug in RocksDB. The bug affects windows
>>>>>>>>>>>> without reduce() or fold(), windows with evictors, and ListState.
>>>>>>>>>>>>
>>>>>>>>>>>> A certain access pattern in RocksDB starts being so slow after
>>>>>>>>>>>> a certain size-per-key that it basically brings down the streaming program
>>>>>>>>>>>> and the snapshots.
>>>>>>>>>>>>
>>>>>>>>>>>> We are reaching out to the RocksDB folks and looking for
>>>>>>>>>>>> workarounds in Flink.
>>>>>>>>>>>>
>>>>>>>>>>>> Greetings,
>>>>>>>>>>>> Stephan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>>>>>>> value, that can be a cause of problems...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vinay,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stephan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>>>>>>> with more specific.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Seth Wiesman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Seth,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But if the issue is only related to S3, then why does this
>>>>>>>>>>>>>> happen when I replace the S3 sink  to HDFS as well (for checkpointing I am
>>>>>>>>>>>>>> using HDFS only )
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stephan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) ,
>>>>>>>>>>>>>> and keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>>>>>>> expecting pending files here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This issue gets worst when checkpointing is disabled  as
>>>>>>>>>>>>>> nothing is written.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vinay Patil
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache
>>>>>>>>>>>>>> Flink User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Seth!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have actually seen these issues as well and we are looking
>>>>>>>>>>>>>> to eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you think your patch would be a good starting point for
>>>>>>>>>>>>>> that and would you be willing to share it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Greetings,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stephan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’ve been running pipelines with similar state size using
>>>>>>>>>>>>>> rocksdb which externalize to S3 and bucket to S3. I was getting stalls like
>>>>>>>>>>>>>> this and ended up tracing the problem to S3 and the bucketing sink. The
>>>>>>>>>>>>>> solution was two fold:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a
>>>>>>>>>>>>>> source of truth. Emr uses a dynamodb table to determine if S3 is
>>>>>>>>>>>>>> inconsistent. Instead I say that if flink believes that a file exists on S3
>>>>>>>>>>>>>> and we don’t see it then I am going to trust that flink is in a consistent
>>>>>>>>>>>>>> state and S3 is not. In this case, various operations will perform a back
>>>>>>>>>>>>>> off and retry up to a certain number of times.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2)       The bucketing sink performs multiple renames over
>>>>>>>>>>>>>> the lifetime of a file, occurring when a checkpoint starts and then again
>>>>>>>>>>>>>> on notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>>>>>>> together.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Seth Wiesman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> HI Stephan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for
>>>>>>>>>>>>>> writing the data, and using HDFS for storing checkpoints.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vinay Patil
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden
>>>>>>>>>>>>>> email]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Stephan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced
>>>>>>>>>>>>>> the S3 sink with HDFS and kept minimum pause between checkpoints to
>>>>>>>>>>>>>> 5minutes, still I see the same issue with checkpoints getting failed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vinay Patil
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache
>>>>>>>>>>>>>> Flink User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Flink's state backends currently do a good number of "make
>>>>>>>>>>>>>> sure this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>>>>>>> blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are gradually trying to improve the behavior there and be
>>>>>>>>>>>>>> more S3 aware.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain
>>>>>>>>>>>>>> improvements there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stephan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Stephan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So do you mean that S3 is causing the stall , as I have
>>>>>>>>>>>>>> mentioned in my previous mail, I could not see any progress for 16minutes
>>>>>>>>>>>>>> as checkpoints were getting failed continuously.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vinay!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While it is rather small state, we have seen before that on
>>>>>>>>>>>>>> S3 it can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stephan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>>>>>>> As you can see all the 3 checkpointins failed , for
>>>>>>>>>>>>>> checkpoint ID 2 and 3 it
>>>>>>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>>>>>>> source 2 is
>>>>>>>>>>>>>> 15GB )
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>>>>>>> 16minutes the
>>>>>>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M
>>>>>>>>>>>>>> because of
>>>>>>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> View this message in context: http://apache-flink-u
>>>>>>>>>>>>>> ser-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpoint
>>>>>>>>>>>>>> ing-with-RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>>>>>> list archive at Nabble.com <http://nabble.com/>.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *If you reply to this email, your message will be added to
>>>>>>>>>>>>>> the discussion below:*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>>>>>> 2p11885.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>>>>>> archive., email[hidden email]
>>>>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>>>>>> click here.
>>>>>>>>>>>>>> NAML
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB
>>>>>>>>>>>>>> as statebackend
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive.
>>>>>>>>>>>>>> mailing list archive
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>>>>>>  at Nabble.com <http://nabble.com/>.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *If you reply to this email, your message will be added to
>>>>>>>>>>>>>> the discussion below:*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>>>>>> 2p11891.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>>>>>> click here.
>>>>>>>>>>>>>> NAML
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB
>>>>>>>>>>>>>> as statebackend
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive.
>>>>>>>>>>>>>> mailing list archive
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>>>>>>  at Nabble.com <http://nabble.com/>.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *If you reply to this email, your message will be added to
>>>>>>>>>>>>>> the discussion below:*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>>>>>> 2p11943.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>> NAML
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB
>>>>>>>>>>>>>> as statebackend
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive.
>>>>>>>>>>>>>> mailing list archive
>>>>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>>>>>>  at Nabble.com <http://nabble.com/>.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>> If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p12209.html
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>>> here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>  at Nabble.com.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> If you reply to this email, your message will be added to the
>>>>>>>> discussion below:
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>> 2p12225.html
>>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>>> email [hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=12234&i=1>
>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>> here.
>>>>>>>> NAML
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>> statebackend
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>> archive
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>  at Nabble.com.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p12243.html
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=12274&i=1>
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12274.html>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>  at Nabble.com.
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> If you reply to this email, your message will be added to the
>>>> discussion below:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p12276.html
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email <a href="x-msg://1/user/SendEmail.jtp?type=node&amp;node=12425&amp;i=1"
>>>> target="_top" rel="nofollow" link="external" class="">[hidden email]
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>> ------------------------------
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12425.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com <http://nabble.com/>.
>>>
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12429.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email ml-node+s2336050n1h83@n4.nabble.com
>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
>>> .
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Stefan Richter <s....@data-artisans.com>.

Hi,

I was able to come up with a custom build of RocksDB yesterday that seems to fix the problems. I still have to build the native code for different platforms and then test it. I cannot make promises about the 1.2.1 release, but I would be optimistic that this will make it in.

Best,
Stefan

> Am 27.03.2017 um 19:12 schrieb vinay patil <vi...@gmail.com>:
> 
> Hi Stephan,
> 
> Just an update, last week I did a run with state size close to 18GB, I did not observe the pipeline getting stopped in between with G1GC enabled.
> 
> I had observed checkpoint failures when the state size was close to 38GB (but in this case G1GC was not enabled)
> 
> Is it possible to get the RocksDB fix in 1.2.1 so that I can test it out.
> 
> 
> Regards,
> Vinay Patil
> 
> On Sat, Mar 18, 2017 at 12:25 AM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email] <x-msg://1/user/SendEmail.jtp?type=node&node=12425&i=0>> wrote:
> @vinay Let's see how fast we get this fix in - I hope yes. It may depend also a bit on the RocksDB community.
> 
> In any case, if it does not make it in, we can do a 1.2.2 release immediately after (I think the problem is big enough to warrant that), or at least release a custom version of the RocksDB state backend that includes the fix.
> 
> Stephan
> 
> 
> On Fri, Mar 17, 2017 at 5:51 PM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=12276&i=0>> wrote:
> Hi Stephan,
> 
> Is the performance related change  of RocksDB going to be part of Flink 1.2.1 ?
> 
> Regards,
> Vinay Patil
> 
> On Thu, Mar 16, 2017 at 6:13 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email] <http://user/SendEmail.jtp?type=node&node=12274&i=0>> wrote:
> The only immediate workaround is to use windows with "reduce" or "fold" or "aggregate" and not "apply". And to not use an evictor.
> 
> The good news is that I think we have a good way of fixing this soon, making an adjustment in RocksDB.
> 
> For the Yarn / g1gc question: Not 100% sure about that - you can check if it used g1gc. If not, you may be able to pass this through the "env.java.opts" parameter. (cc robert for confirmation)
> 
> Stephan
> 
> 
> 
> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=12243&i=0>> wrote:
> Hi Stephan,
> 
> What can be the workaround for this ?
> 
> Also need one confirmation : Is G1 GC used by default when running the pipeline on YARN. (I see a thread of 2015 where G1 is used by default for JAVA8)
> 
> 
> 
> Regards,
> Vinay Patil
> 
> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email] <http://user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
> Hi Vinay!
> 
> Savepoints also call the same problematic RocksDB function, unfortunately.
> 
> We will have a fix next month. We either (1) get a patched RocksDB version or we (2) implement a different pattern for ListState in Flink.
> 
> (1) would be the better solution, so we are waiting for a response from the RocksDB folks. (2) is always possible if we cannot get a fix from RocksDB.
> 
> Stephan
> 
> 
> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
> Hi Stephan,
> 
> Thank you for making me aware of this.
> 
> Yes I am using a window without reduce function (Apply function). The discussion happening on JIRA is exactly what I am observing, consistent failure of checkpoints after some time and the stream halts.
> 
> We want to go live in next month, not sure how this will affect in production as we are going to get above 200 million data.
> 
> As a workaround can I take the savepoint while the pipeline is running ? Let's say if I take savepoint after every 30minutes, will it work ?
> 
> 
> 
> Regards,
> Vinay Patil
> 
> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email] <http://user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756 <https://issues.apache.org/jira/browse/FLINK-5756>
> 
> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
> Hi Vinay,
> 
> I think the issue is tracked here: https://github.com/facebook/rocksdb/issues/1988 <https://github.com/facebook/rocksdb/issues/1988>.
> 
> Best,
> Stefan
> 
>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=1>>:
>> 
>> Hi Stephan,
>> 
>> Is there a ticket number/link to track this, My job has all the conditions you mentioned.
>> 
>> Thanks,
>> Vishnu
>> 
>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>> Hi Vinay!
>> 
>> We just discovered a bug in RocksDB. The bug affects windows without reduce() or fold(), windows with evictors, and ListState.
>> 
>> A certain access pattern in RocksDB starts being so slow after a certain size-per-key that it basically brings down the streaming program and the snapshots.
>> 
>> We are reaching out to the RocksDB folks and looking for workarounds in Flink.
>> 
>> Greetings,
>> Stephan
>> 
>> 
>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>> @vinay  Can you try to not set the buffer timeout at all? I am actually not sure what would be the effect of setting it to a negative value, that can be a cause of problems...
>> 
>> 
>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>> Vinay, 
>> 
>>  
>> 
>> The bucketing sink performs rename operations during the checkpoint and if it tries to rename a file that is not yet consistent that would cause a FileNotFound exception which would fail the checkpoint. 
>> 
>>  
>> 
>> Stephan, 
>> 
>>  
>> 
>> Currently my aws fork contains some very specific assumptions about the pipeline that will in general only hold for my pipeline. This is because there were still some open questions that  I had about how to solve consistency issues in the general case. I will comment on the Jira issue with more specific.
>> 
>>  
>> 
>> Seth Wiesman
>> 
>>  
>> 
>> From: vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=5>>
>> Reply-To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=7>>
>> Date: Monday, February 27, 2017 at 1:05 PM
>> 
>> To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden email] <http://user/SendEmail.jtp?type=node&node=12209&i=9>>
>> 
>> 
>> Subject: Re: Checkpointing with RocksDB as statebackend
>> 
>>  
>> 
>> Hi Seth,
>> 
>> Thank you for your suggestion.
>> 
>> But if the issue is only related to S3, then why does this happen when I replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS only )
>> 
>> Stephan,
>> 
>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep the checkpoint interval to 10minutes, I have observed that nothing gets written to sink (tried with S3 as well as HDFS), atleast I was expecting pending files here.
>> 
>> This issue gets worst when checkpointing is disabled  as nothing is written.
>> 
>>  
>> 
>> 
>> 
>> Regards,
>> 
>> Vinay Patil
>> 
>>  
>> 
>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email] <>> wrote:
>> 
>> Hi Seth! 
>> 
>>  
>> 
>> Wow, that is an awesome approach.
>> 
>>  
>> 
>> We have actually seen these issues as well and we are looking to eventually implement our own S3 file system (and circumvent Hadoop's S3 connector that Flink currently relies on): https://issues.apache.org/jira/browse/FLINK-5706 <https://issues.apache.org/jira/browse/FLINK-5706>
>>  
>> 
>> Do you think your patch would be a good starting point for that and would you be willing to share it?
>> 
>>  
>> 
>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible to fork officially, if necessary...
>> 
>>  
>> 
>> Greetings,
>> 
>> Stephan
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>> 
>> Just wanted to throw in my 2cts.  
>> 
>>  
>> 
>> I’ve been running pipelines with similar state size using rocksdb which externalize to S3 and bucket to S3. I was getting stalls like this and ended up tracing the problem to S3 and the bucketing sink. The solution was two fold:
>> 
>>  
>> 
>> 1)       I forked hadoop-aws and have it treat flink as a source of truth. Emr uses a dynamodb table to determine if S3 is inconsistent. Instead I say that if flink believes that a file exists on S3 and we don’t see it then I am going to trust that flink is in a consistent state and S3 is not. In this case, various operations will perform a back off and retry up to a certain number of times.
>> 
>>  
>> 
>> 2)       The bucketing sink performs multiple renames over the lifetime of a file, occurring when a checkpoint starts and then again on notification after it completes. Due to S3’s consistency guarantees the second rename of file can never be assured to work and will eventually fail either during or after a checkpoint. Because there is no upper bound on the time it will take for a file on S3 to become consistent, retries cannot solve this specific problem as it could take upwards of many minutes to rename which would stall the entire pipeline. The only viable solution I could find was to write a custom sink which understands S3. Each writer will write file locally and then copy it to S3 on checkpoint. By only interacting with S3 once per file it can circumvent consistency issues all together. 
>> 
>>  
>> 
>> Hope this helps,
>> 
>>  
>> 
>> Seth Wiesman
>> 
>>  
>> 
>> From: vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>> Reply-To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>> Date: Saturday, February 25, 2017 at 10:50 AM
>> To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>> Subject: Re: Checkpointing with RocksDB as statebackend
>> 
>>  
>> 
>> HI Stephan,
>> 
>> Just to avoid the confusion here, I am using S3 sink for writing the data, and using HDFS for storing checkpoints.
>> 
>> There are 2 core nodes (HDFS) and two task nodes on EMR
>> 
>> 
>> I replaced s3 sink with HDFS for writing data in my last test.
>> 
>> Let's say the checkpoint interval is 5 minutes, now within 5minutes of run the state size grows to 30GB ,  after checkpointing the 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?  is this causing the pipeline to stall ?
>> 
>> 
>> 
>> Regards,
>> 
>> Vinay Patil
>> 
>>  
>> 
>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>> 
>> Hi Stephan,
>> 
>> To verify if S3 is making teh pipeline stall, I have replaced the S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes, still I see the same issue with checkpoints getting failed.
>> 
>> If I keep the  pause time to 20 seconds, all checkpoints are completed , however there is a hit in overall throughput.
>> 
>>  
>> 
>> 
>> 
>> Regards,
>> 
>> Vinay Patil
>> 
>>  
>> 
>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:
>> 
>> Flink's state backends currently do a good number of "make sure this exists" operations on the file systems. Through Hadoop's S3 filesystem, that translates to S3 bucket list operations, where there is a limit in how many operation may happen per time interval. After that, S3 blocks.
>> 
>>  
>> 
>> It seems that operations that are totally cheap on HDFS are hellishly expensive (and limited) on S3. It may be that you are affected by that.
>> 
>>  
>> 
>> We are gradually trying to improve the behavior there and be more S3 aware.
>> 
>>  
>> 
>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>> 
>>  
>> 
>> Best,
>> 
>> Stephan
>> 
>>  
>> 
>>  
>> 
>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>> 
>> Hi Stephan,
>> 
>> So do you mean that S3 is causing the stall , as I have mentioned in my previous mail, I could not see any progress for 16minutes as checkpoints were getting failed continuously.
>> 
>>  
>> 
>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing List archive.]" <[hidden email] <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>> 
>> Hi Vinay!
>> 
>>  
>> 
>> True, the operator state (like Kafka) is currently not asynchronously checkpointed.
>> 
>>  
>> 
>> While it is rather small state, we have seen before that on S3 it can cause trouble, because S3 frequently stalls uploads of even data amounts as low as kilobytes due to its throttling policies.
>> 
>>  
>> 
>> That would be a super important fix to add!
>> 
>>  
>> 
>> Best,
>> 
>> Stephan
>> 
>>  
>> 
>>  
>> 
>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>> 
>> Hi,
>> 
>> I have attached a snapshot for reference:
>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and 3 it
>> is stuck at the Kafka source after 50%
>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 is
>> 15GB )
>> 
>> Within 10minutes 15M records were processed, and for the next 16minutes the
>> pipeline is stuck , I don't see any progress beyond 15M because of
>> checkpoints getting failed consistently.
>> 
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11882/Checkpointing_Failed.png <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11882/Checkpointing_Failed.png>>
>> 
>> 
>> 
>> --
>> View this message in context:http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11882.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11882.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list archive atNabble.com <http://nabble.com/>.
>> 
>>  
>> 
>>  
>> 
>> If you reply to this email, your message will be added to the discussion below:
>> 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html>
>> To start a new topic under Apache Flink User Mailing List archive., email[hidden email] <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>  
>> 
>> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com <http://nabble.com/>.
>> 
>>  
>> 
>>  
>> 
>> If you reply to this email, your message will be added to the discussion below:
>> 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html>
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] 
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>  
>> 
>>  
>> 
>>  
>> 
>> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com <http://nabble.com/>.
>> 
>>  
>> 
>>  
>> 
>> If you reply to this email, your message will be added to the discussion below:
>> 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11943.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11943.html>
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <> 
>> To unsubscribe from Apache Flink User Mailing List archive., click here <>.
>> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>  
>> 
>>  
>> 
>> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com <http://nabble.com/>.
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12209.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12209.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <http://user/SendEmail.jtp?type=node&node=12224&i=1> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12225.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12225.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <http://user/SendEmail.jtp?type=node&node=12234&i=1> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12243.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12243.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <http://user/SendEmail.jtp?type=node&node=12274&i=1> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12274.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12276.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12276.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <x-msg://1/user/SendEmail.jtp?type=node&node=12425&i=1> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <applewebdata://80CCD40D-2F48-4D8E-B133-23B863524030>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12425.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com <http://nabble.com/>.

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Stephan,

Just an update, last week I did a run with state size close to 18GB, I did
not observe the pipeline getting stopped in between with G1GC enabled.

I had observed checkpoint failures when the state size was close to 38GB
(but in this case G1GC was not enabled)

Is it possible to get the RocksDB fix in 1.2.1 so that I can test it out.


Regards,
Vinay Patil

On Sat, Mar 18, 2017 at 12:25 AM, Stephan Ewen [via Apache Flink User
Mailing List archive.] <ml...@n4.nabble.com> wrote:

> @vinay Let's see how fast we get this fix in - I hope yes. It may depend
> also a bit on the RocksDB community.
>
> In any case, if it does not make it in, we can do a 1.2.2 release
> immediately after (I think the problem is big enough to warrant that), or
> at least release a custom version of the RocksDB state backend that
> includes the fix.
>
> Stephan
>
>
> On Fri, Mar 17, 2017 at 5:51 PM, vinay patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12276&i=0>> wrote:
>
>> Hi Stephan,
>>
>> Is the performance related change  of RocksDB going to be part of Flink
>> 1.2.1 ?
>>
>> Regards,
>> Vinay Patil
>>
>> On Thu, Mar 16, 2017 at 6:13 PM, Stephan Ewen [via Apache Flink User
>> Mailing List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12274&i=0>> wrote:
>>
>>> The only immediate workaround is to use windows with "reduce" or "fold"
>>> or "aggregate" and not "apply". And to not use an evictor.
>>>
>>> The good news is that I think we have a good way of fixing this soon,
>>> making an adjustment in RocksDB.
>>>
>>> For the Yarn / g1gc question: Not 100% sure about that - you can check
>>> if it used g1gc. If not, you may be able to pass this through the
>>> "env.java.opts" parameter. (cc robert for confirmation)
>>>
>>> Stephan
>>>
>>>
>>>
>>> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12243&i=0>> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> What can be the workaround for this ?
>>>>
>>>> Also need one confirmation : Is G1 GC used by default when running the
>>>> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
>>>> JAVA8)
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Vinay Patil
>>>>
>>>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
>>>> Mailing List archive.] <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>>>
>>>>> Hi Vinay!
>>>>>
>>>>> Savepoints also call the same problematic RocksDB function,
>>>>> unfortunately.
>>>>>
>>>>> We will have a fix next month. We either (1) get a patched RocksDB
>>>>> version or we (2) implement a different pattern for ListState in Flink.
>>>>>
>>>>> (1) would be the better solution, so we are waiting for a response
>>>>> from the RocksDB folks. (2) is always possible if we cannot get a fix from
>>>>> RocksDB.
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>>>
>>>>>> Hi Stephan,
>>>>>>
>>>>>> Thank you for making me aware of this.
>>>>>>
>>>>>> Yes I am using a window without reduce function (Apply function). The
>>>>>> discussion happening on JIRA is exactly what I am observing, consistent
>>>>>> failure of checkpoints after some time and the stream halts.
>>>>>>
>>>>>> We want to go live in next month, not sure how this will affect in
>>>>>> production as we are going to get above 200 million data.
>>>>>>
>>>>>> As a workaround can I take the savepoint while the pipeline is
>>>>>> running ? Let's say if I take savepoint after every 30minutes, will it work
>>>>>> ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Vinay Patil
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>>>>> Mailing List archive.] <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>>>
>>>>>>> The issue in Flink is https://issues.apache.org/j
>>>>>>> ira/browse/FLINK-5756
>>>>>>>
>>>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>>>
>>>>>>>> Hi Vinay,
>>>>>>>>
>>>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>>>> ook/rocksdb/issues/1988.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>>>
>>>>>>>> Hi Stephan,
>>>>>>>>
>>>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>>>> conditions you mentioned.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Vishnu
>>>>>>>>
>>>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>>>
>>>>>>>>> Hi Vinay!
>>>>>>>>>
>>>>>>>>> We just discovered a bug in RocksDB. The bug affects windows
>>>>>>>>> without reduce() or fold(), windows with evictors, and ListState.
>>>>>>>>>
>>>>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>>>>> and the snapshots.
>>>>>>>>>
>>>>>>>>> We are reaching out to the RocksDB folks and looking for
>>>>>>>>> workarounds in Flink.
>>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>>>
>>>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>>>> value, that can be a cause of problems...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Vinay,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Stephan,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>>>> with more specific.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Seth Wiesman
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Seth,
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>>>
>>>>>>>>>>> But if the issue is only related to S3, then why does this
>>>>>>>>>>> happen when I replace the S3 sink  to HDFS as well (for checkpointing I am
>>>>>>>>>>> using HDFS only )
>>>>>>>>>>>
>>>>>>>>>>> Stephan,
>>>>>>>>>>>
>>>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>>>> expecting pending files here.
>>>>>>>>>>>
>>>>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing
>>>>>>>>>>> is written.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Vinay Patil
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Seth!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Do you think your patch would be a good starting point for that
>>>>>>>>>>> and would you be willing to share it?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Greetings,
>>>>>>>>>>>
>>>>>>>>>>> Stephan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I’ve been running pipelines with similar state size using
>>>>>>>>>>> rocksdb which externalize to S3 and bucket to S3. I was getting stalls like
>>>>>>>>>>> this and ended up tracing the problem to S3 and the bucketing sink. The
>>>>>>>>>>> solution was two fold:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a
>>>>>>>>>>> source of truth. Emr uses a dynamodb table to determine if S3 is
>>>>>>>>>>> inconsistent. Instead I say that if flink believes that a file exists on S3
>>>>>>>>>>> and we don’t see it then I am going to trust that flink is in a consistent
>>>>>>>>>>> state and S3 is not. In this case, various operations will perform a back
>>>>>>>>>>> off and retry up to a certain number of times.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>>>> together.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Seth Wiesman
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> HI Stephan,
>>>>>>>>>>>
>>>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>>>>
>>>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>>>
>>>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Vinay Patil
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Stephan,
>>>>>>>>>>>
>>>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced
>>>>>>>>>>> the S3 sink with HDFS and kept minimum pause between checkpoints to
>>>>>>>>>>> 5minutes, still I see the same issue with checkpoints getting failed.
>>>>>>>>>>>
>>>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Vinay Patil
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>>>> blocks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>>>> that.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We are gradually trying to improve the behavior there and be
>>>>>>>>>>> more S3 aware.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>>>>> there.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Stephan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Stephan,
>>>>>>>>>>>
>>>>>>>>>>> So do you mean that S3 is causing the stall , as I have
>>>>>>>>>>> mentioned in my previous mail, I could not see any progress for 16minutes
>>>>>>>>>>> as checkpoints were getting failed continuously.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Vinay!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> While it is rather small state, we have seen before that on S3
>>>>>>>>>>> it can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Stephan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint
>>>>>>>>>>> ID 2 and 3 it
>>>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>>>> source 2 is
>>>>>>>>>>> 15GB )
>>>>>>>>>>>
>>>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>>>> 16minutes the
>>>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because
>>>>>>>>>>> of
>>>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>>>
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>>>
>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>>> list archive at Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>>> discussion below:*
>>>>>>>>>>>
>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>>> 2p11885.html
>>>>>>>>>>>
>>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>>> click here.
>>>>>>>>>>> NAML
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>>> statebackend
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>>>
>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>>> list archive
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>>> at Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>>> discussion below:*
>>>>>>>>>>>
>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>>> 2p11891.html
>>>>>>>>>>>
>>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>>> click here.
>>>>>>>>>>> NAML
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>>> statebackend
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>>> list archive
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>>> at Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>>> discussion below:*
>>>>>>>>>>>
>>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>>> 2p11943.html
>>>>>>>>>>>
>>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>>>> here.
>>>>>>>>>>> NAML
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>>> statebackend
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>>> list archive
>>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>>> at Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> If you reply to this email, your message will be added to the
>>>>>>> discussion below:
>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>> 2p12209.html
>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>> email [hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>> here.
>>>>>>> NAML
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>> statebackend
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>> archive
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>> at Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>> 2p12225.html
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>> statebackend
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12243.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12274&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12274.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p12276.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12425.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Stephan Ewen <se...@apache.org>.

@vinay Let's see how fast we get this fix in - I hope yes. It may depend
also a bit on the RocksDB community.

In any case, if it does not make it in, we can do a 1.2.2 release
immediately after (I think the problem is big enough to warrant that), or
at least release a custom version of the RocksDB state backend that
includes the fix.

Stephan


On Fri, Mar 17, 2017 at 5:51 PM, vinay patil <vi...@gmail.com>
wrote:

> Hi Stephan,
>
> Is the performance related change  of RocksDB going to be part of Flink
> 1.2.1 ?
>
> Regards,
> Vinay Patil
>
> On Thu, Mar 16, 2017 at 6:13 PM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12274&i=0>> wrote:
>
>> The only immediate workaround is to use windows with "reduce" or "fold"
>> or "aggregate" and not "apply". And to not use an evictor.
>>
>> The good news is that I think we have a good way of fixing this soon,
>> making an adjustment in RocksDB.
>>
>> For the Yarn / g1gc question: Not 100% sure about that - you can check if
>> it used g1gc. If not, you may be able to pass this through the
>> "env.java.opts" parameter. (cc robert for confirmation)
>>
>> Stephan
>>
>>
>>
>> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12243&i=0>> wrote:
>>
>>> Hi Stephan,
>>>
>>> What can be the workaround for this ?
>>>
>>> Also need one confirmation : Is G1 GC used by default when running the
>>> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
>>> JAVA8)
>>>
>>>
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>>
>>>> Hi Vinay!
>>>>
>>>> Savepoints also call the same problematic RocksDB function,
>>>> unfortunately.
>>>>
>>>> We will have a fix next month. We either (1) get a patched RocksDB
>>>> version or we (2) implement a different pattern for ListState in Flink.
>>>>
>>>> (1) would be the better solution, so we are waiting for a response from
>>>> the RocksDB folks. (2) is always possible if we cannot get a fix from
>>>> RocksDB.
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> Thank you for making me aware of this.
>>>>>
>>>>> Yes I am using a window without reduce function (Apply function). The
>>>>> discussion happening on JIRA is exactly what I am observing, consistent
>>>>> failure of checkpoints after some time and the stream halts.
>>>>>
>>>>> We want to go live in next month, not sure how this will affect in
>>>>> production as we are going to get above 200 million data.
>>>>>
>>>>> As a workaround can I take the savepoint while the pipeline is running
>>>>> ? Let's say if I take savepoint after every 30minutes, will it work ?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Vinay Patil
>>>>>
>>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>>>> Mailing List archive.] <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>>
>>>>>> The issue in Flink is https://issues.apache.org/j
>>>>>> ira/browse/FLINK-5756
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>>
>>>>>>> Hi Vinay,
>>>>>>>
>>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>>> ook/rocksdb/issues/1988.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>>
>>>>>>> Hi Stephan,
>>>>>>>
>>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>>> conditions you mentioned.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vishnu
>>>>>>>
>>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>>
>>>>>>>> Hi Vinay!
>>>>>>>>
>>>>>>>> We just discovered a bug in RocksDB. The bug affects windows
>>>>>>>> without reduce() or fold(), windows with evictors, and ListState.
>>>>>>>>
>>>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>>>> and the snapshots.
>>>>>>>>
>>>>>>>> We are reaching out to the RocksDB folks and looking for
>>>>>>>> workarounds in Flink.
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>>
>>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>>> value, that can be a cause of problems...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>>
>>>>>>>>>> Vinay,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Stephan,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>>> with more specific.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Seth Wiesman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Seth,
>>>>>>>>>>
>>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>>
>>>>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>>>>> HDFS only )
>>>>>>>>>>
>>>>>>>>>> Stephan,
>>>>>>>>>>
>>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>>> expecting pending files here.
>>>>>>>>>>
>>>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing
>>>>>>>>>> is written.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Seth!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you think your patch would be a good starting point for that
>>>>>>>>>> and would you be willing to share it?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>>>>> was two fold:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source
>>>>>>>>>> of truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>>>>> up to a certain number of times.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>>> together.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Seth Wiesman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> HI Stephan,
>>>>>>>>>>
>>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>>>
>>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>>
>>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Stephan,
>>>>>>>>>>
>>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the
>>>>>>>>>> S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>>>>
>>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>>> blocks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>>> that.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We are gradually trying to improve the behavior there and be more
>>>>>>>>>> S3 aware.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Stephan,
>>>>>>>>>>
>>>>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned
>>>>>>>>>> in my previous mail, I could not see any progress for 16minutes as
>>>>>>>>>> checkpoints were getting failed continuously.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Vinay!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID
>>>>>>>>>> 2 and 3 it
>>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>>> source 2 is
>>>>>>>>>> 15GB )
>>>>>>>>>>
>>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>>> 16minutes the
>>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>>
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11885.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>> click here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11891.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>> click here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11943.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>>> here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p12209.html
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>> at Nabble.com.
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> If you reply to this email, your message will be added to the
>>>> discussion below:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p12225.html
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>> ------------------------------
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p12243.html
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=12274&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12274.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Stephan,

Is the performance related change  of RocksDB going to be part of Flink
1.2.1 ?

Regards,
Vinay Patil

On Thu, Mar 16, 2017 at 6:13 PM, Stephan Ewen [via Apache Flink User
Mailing List archive.] <ml...@n4.nabble.com> wrote:

> The only immediate workaround is to use windows with "reduce" or "fold" or
> "aggregate" and not "apply". And to not use an evictor.
>
> The good news is that I think we have a good way of fixing this soon,
> making an adjustment in RocksDB.
>
> For the Yarn / g1gc question: Not 100% sure about that - you can check if
> it used g1gc. If not, you may be able to pass this through the
> "env.java.opts" parameter. (cc robert for confirmation)
>
> Stephan
>
>
>
> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12243&i=0>> wrote:
>
>> Hi Stephan,
>>
>> What can be the workaround for this ?
>>
>> Also need one confirmation : Is G1 GC used by default when running the
>> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
>> JAVA8)
>>
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
>> Mailing List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>
>>> Hi Vinay!
>>>
>>> Savepoints also call the same problematic RocksDB function,
>>> unfortunately.
>>>
>>> We will have a fix next month. We either (1) get a patched RocksDB
>>> version or we (2) implement a different pattern for ListState in Flink.
>>>
>>> (1) would be the better solution, so we are waiting for a response from
>>> the RocksDB folks. (2) is always possible if we cannot get a fix from
>>> RocksDB.
>>>
>>> Stephan
>>>
>>>
>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> Thank you for making me aware of this.
>>>>
>>>> Yes I am using a window without reduce function (Apply function). The
>>>> discussion happening on JIRA is exactly what I am observing, consistent
>>>> failure of checkpoints after some time and the stream halts.
>>>>
>>>> We want to go live in next month, not sure how this will affect in
>>>> production as we are going to get above 200 million data.
>>>>
>>>> As a workaround can I take the savepoint while the pipeline is running
>>>> ? Let's say if I take savepoint after every 30minutes, will it work ?
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Vinay Patil
>>>>
>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>>> Mailing List archive.] <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>
>>>>> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756
>>>>>
>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>
>>>>>> Hi Vinay,
>>>>>>
>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>> ook/rocksdb/issues/1988.
>>>>>>
>>>>>> Best,
>>>>>> Stefan
>>>>>>
>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>
>>>>>> Hi Stephan,
>>>>>>
>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>> conditions you mentioned.
>>>>>>
>>>>>> Thanks,
>>>>>> Vishnu
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>
>>>>>>> Hi Vinay!
>>>>>>>
>>>>>>> We just discovered a bug in RocksDB. The bug affects windows without
>>>>>>> reduce() or fold(), windows with evictors, and ListState.
>>>>>>>
>>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>>> and the snapshots.
>>>>>>>
>>>>>>> We are reaching out to the RocksDB folks and looking for workarounds
>>>>>>> in Flink.
>>>>>>>
>>>>>>> Greetings,
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>
>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>> value, that can be a cause of problems...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>
>>>>>>>>> Vinay,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Stephan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>> with more specific.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seth Wiesman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>> *To: *"[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Seth,
>>>>>>>>>
>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>
>>>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>>>> HDFS only )
>>>>>>>>>
>>>>>>>>> Stephan,
>>>>>>>>>
>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>> expecting pending files here.
>>>>>>>>>
>>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing
>>>>>>>>> is written.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Seth!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you think your patch would be a good starting point for that
>>>>>>>>> and would you be willing to share it?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>>>> was two fold:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source
>>>>>>>>> of truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>>>> up to a certain number of times.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>> together.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seth Wiesman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>> *To: *"[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> HI Stephan,
>>>>>>>>>
>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>>
>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>
>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Stephan,
>>>>>>>>>
>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the
>>>>>>>>> S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>>>
>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>> blocks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>> that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We are gradually trying to improve the behavior there and be more
>>>>>>>>> S3 aware.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Stephan,
>>>>>>>>>
>>>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned
>>>>>>>>> in my previous mail, I could not see any progress for 16minutes as
>>>>>>>>> checkpoints were getting failed continuously.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Vinay!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID
>>>>>>>>> 2 and 3 it
>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>> source 2 is
>>>>>>>>> 15GB )
>>>>>>>>>
>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>> 16minutes the
>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>>>> archive at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>> discussion below:*
>>>>>>>>>
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>> 2p11885.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>> archive., email [hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>> here.
>>>>>>>>> NAML
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>> at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>> discussion below:*
>>>>>>>>>
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>> 2p11891.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>> archive., email [hidden email]
>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>> here.
>>>>>>>>> NAML
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>> at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>> discussion below:*
>>>>>>>>>
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>> 2p11943.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>> archive., email [hidden email]
>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>> here.
>>>>>>>>> NAML
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>> at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>> 2p12209.html
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>> statebackend
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12225.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p12243.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12274.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Robert Metzger <rm...@apache.org>.

Yes, you can change the GC using the env.java.opts parameter.
We are not setting any GC on YARN.

On Thu, Mar 16, 2017 at 1:50 PM, Stephan Ewen <se...@apache.org> wrote:

> The only immediate workaround is to use windows with "reduce" or "fold" or
> "aggregate" and not "apply". And to not use an evictor.
>
> The good news is that I think we have a good way of fixing this soon,
> making an adjustment in RocksDB.
>
> For the Yarn / g1gc question: Not 100% sure about that - you can check if
> it used g1gc. If not, you may be able to pass this through the
> "env.java.opts" parameter. (cc robert for confirmation)
>
> Stephan
>
>
>
> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <vi...@gmail.com>
> wrote:
>
>> Hi Stephan,
>>
>> What can be the workaround for this ?
>>
>> Also need one confirmation : Is G1 GC used by default when running the
>> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
>> JAVA8)
>>
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
>> Mailing List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>
>>> Hi Vinay!
>>>
>>> Savepoints also call the same problematic RocksDB function,
>>> unfortunately.
>>>
>>> We will have a fix next month. We either (1) get a patched RocksDB
>>> version or we (2) implement a different pattern for ListState in Flink.
>>>
>>> (1) would be the better solution, so we are waiting for a response from
>>> the RocksDB folks. (2) is always possible if we cannot get a fix from
>>> RocksDB.
>>>
>>> Stephan
>>>
>>>
>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> Thank you for making me aware of this.
>>>>
>>>> Yes I am using a window without reduce function (Apply function). The
>>>> discussion happening on JIRA is exactly what I am observing, consistent
>>>> failure of checkpoints after some time and the stream halts.
>>>>
>>>> We want to go live in next month, not sure how this will affect in
>>>> production as we are going to get above 200 million data.
>>>>
>>>> As a workaround can I take the savepoint while the pipeline is running
>>>> ? Let's say if I take savepoint after every 30minutes, will it work ?
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Vinay Patil
>>>>
>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>>> Mailing List archive.] <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>
>>>>> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756
>>>>>
>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>
>>>>>> Hi Vinay,
>>>>>>
>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>> ook/rocksdb/issues/1988.
>>>>>>
>>>>>> Best,
>>>>>> Stefan
>>>>>>
>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>
>>>>>> Hi Stephan,
>>>>>>
>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>> conditions you mentioned.
>>>>>>
>>>>>> Thanks,
>>>>>> Vishnu
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>
>>>>>>> Hi Vinay!
>>>>>>>
>>>>>>> We just discovered a bug in RocksDB. The bug affects windows without
>>>>>>> reduce() or fold(), windows with evictors, and ListState.
>>>>>>>
>>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>>> and the snapshots.
>>>>>>>
>>>>>>> We are reaching out to the RocksDB folks and looking for workarounds
>>>>>>> in Flink.
>>>>>>>
>>>>>>> Greetings,
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>
>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>> value, that can be a cause of problems...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>
>>>>>>>>> Vinay,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Stephan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>> with more specific.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seth Wiesman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>> *To: *"[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Seth,
>>>>>>>>>
>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>
>>>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>>>> HDFS only )
>>>>>>>>>
>>>>>>>>> Stephan,
>>>>>>>>>
>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>> expecting pending files here.
>>>>>>>>>
>>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing
>>>>>>>>> is written.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Seth!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you think your patch would be a good starting point for that
>>>>>>>>> and would you be willing to share it?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>>>> was two fold:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source
>>>>>>>>> of truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>>>> up to a certain number of times.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>> together.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seth Wiesman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>> *To: *"[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> HI Stephan,
>>>>>>>>>
>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>>
>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>
>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Stephan,
>>>>>>>>>
>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the
>>>>>>>>> S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>>>
>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Vinay Patil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>> blocks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>> that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We are gradually trying to improve the behavior there and be more
>>>>>>>>> S3 aware.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Stephan,
>>>>>>>>>
>>>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned
>>>>>>>>> in my previous mail, I could not see any progress for 16minutes as
>>>>>>>>> checkpoints were getting failed continuously.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Vinay!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID
>>>>>>>>> 2 and 3 it
>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>> source 2 is
>>>>>>>>> 15GB )
>>>>>>>>>
>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>> 16minutes the
>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>>>> archive at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>> discussion below:*
>>>>>>>>>
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>> 2p11885.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>> archive., email [hidden email]
>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>> here.
>>>>>>>>> NAML
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>> at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>> discussion below:*
>>>>>>>>>
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>> 2p11891.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>> archive., email [hidden email]
>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>> here.
>>>>>>>>> NAML
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>> at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>> discussion below:*
>>>>>>>>>
>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>> 2p11943.html
>>>>>>>>>
>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>> archive., email [hidden email]
>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>> here.
>>>>>>>>> NAML
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>> statebackend
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>> list archive
>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>> at Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>> 2p12209.html
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>> statebackend
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12225.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Stephan Ewen <se...@apache.org>.

The only immediate workaround is to use windows with "reduce" or "fold" or
"aggregate" and not "apply". And to not use an evictor.

The good news is that I think we have a good way of fixing this soon,
making an adjustment in RocksDB.

For the Yarn / g1gc question: Not 100% sure about that - you can check if
it used g1gc. If not, you may be able to pass this through the
"env.java.opts" parameter. (cc robert for confirmation)

Stephan



On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <vi...@gmail.com>
wrote:

> Hi Stephan,
>
> What can be the workaround for this ?
>
> Also need one confirmation : Is G1 GC used by default when running the
> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
> JAVA8)
>
>
>
> Regards,
> Vinay Patil
>
> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>
>> Hi Vinay!
>>
>> Savepoints also call the same problematic RocksDB function, unfortunately.
>>
>> We will have a fix next month. We either (1) get a patched RocksDB
>> version or we (2) implement a different pattern for ListState in Flink.
>>
>> (1) would be the better solution, so we are waiting for a response from
>> the RocksDB folks. (2) is always possible if we cannot get a fix from
>> RocksDB.
>>
>> Stephan
>>
>>
>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>
>>> Hi Stephan,
>>>
>>> Thank you for making me aware of this.
>>>
>>> Yes I am using a window without reduce function (Apply function). The
>>> discussion happening on JIRA is exactly what I am observing, consistent
>>> failure of checkpoints after some time and the stream halts.
>>>
>>> We want to go live in next month, not sure how this will affect in
>>> production as we are going to get above 200 million data.
>>>
>>> As a workaround can I take the savepoint while the pipeline is running ?
>>> Let's say if I take savepoint after every 30minutes, will it work ?
>>>
>>>
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>
>>>> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756
>>>>
>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>
>>>>> Hi Vinay,
>>>>>
>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>> ook/rocksdb/issues/1988.
>>>>>
>>>>> Best,
>>>>> Stefan
>>>>>
>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> Is there a ticket number/link to track this, My job has all the
>>>>> conditions you mentioned.
>>>>>
>>>>> Thanks,
>>>>> Vishnu
>>>>>
>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>
>>>>>> Hi Vinay!
>>>>>>
>>>>>> We just discovered a bug in RocksDB. The bug affects windows without
>>>>>> reduce() or fold(), windows with evictors, and ListState.
>>>>>>
>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>> and the snapshots.
>>>>>>
>>>>>> We are reaching out to the RocksDB folks and looking for workarounds
>>>>>> in Flink.
>>>>>>
>>>>>> Greetings,
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>
>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>> value, that can be a cause of problems...
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>
>>>>>>>> Vinay,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The bucketing sink performs rename operations during the checkpoint
>>>>>>>> and if it tries to rename a file that is not yet consistent that would
>>>>>>>> cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Stephan,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Currently my aws fork contains some very specific assumptions about
>>>>>>>> the pipeline that will in general only hold for my pipeline. This is
>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>> with more specific.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Seth Wiesman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>> *To: *"[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Seth,
>>>>>>>>
>>>>>>>> Thank you for your suggestion.
>>>>>>>>
>>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>>> HDFS only )
>>>>>>>>
>>>>>>>> Stephan,
>>>>>>>>
>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>> expecting pending files here.
>>>>>>>>
>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing is
>>>>>>>> written.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Vinay Patil
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Hi Seth!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>> connector that Flink currently relies on):
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Do you think your patch would be a good starting point for that and
>>>>>>>> would you be willing to share it?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>
>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>>> was two fold:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source
>>>>>>>> of truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>>> up to a certain number of times.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>> together.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Seth Wiesman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>> *To: *"[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> HI Stephan,
>>>>>>>>
>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>
>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>
>>>>>>>>
>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>
>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes
>>>>>>>> of run the state size grows to 30GB ,  after checkpointing the 30GB state
>>>>>>>> that is maintained in rocksDB has to be copied to HDFS, right ?  is this
>>>>>>>> causing the pipeline to stall ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Vinay Patil
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Stephan,
>>>>>>>>
>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the
>>>>>>>> S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>>
>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Vinay Patil
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>> blocks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>> that.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We are gradually trying to improve the behavior there and be more
>>>>>>>> S3 aware.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>> there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>
>>>>>>>> Hi Stephan,
>>>>>>>>
>>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned
>>>>>>>> in my previous mail, I could not see any progress for 16minutes as
>>>>>>>> checkpoints were getting failed continuously.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>
>>>>>>>> Hi Vinay!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>> asynchronously checkpointed.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> That would be a super important fix to add!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have attached a snapshot for reference:
>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2
>>>>>>>> and 3 it
>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>> source 2 is
>>>>>>>> 15GB )
>>>>>>>>
>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>> 16minutes the
>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>
>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>>> archive at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>> discussion below:*
>>>>>>>>
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>> 2p11885.html
>>>>>>>>
>>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>>> email [hidden email]
>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>> here.
>>>>>>>> NAML
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>> statebackend
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>
>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>>> archive
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>> at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>> discussion below:*
>>>>>>>>
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>> 2p11891.html
>>>>>>>>
>>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>>> email [hidden email]
>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>> here.
>>>>>>>> NAML
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>> statebackend
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>>> archive
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>> at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>> discussion below:*
>>>>>>>>
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>> 2p11943.html
>>>>>>>>
>>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>>> email [hidden email]
>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>> here.
>>>>>>>> NAML
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>> statebackend
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>>> archive
>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>> at Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> If you reply to this email, your message will be added to the
>>>> discussion below:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p12209.html
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>> ------------------------------
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p12225.html
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Stephan,

What can be the workaround for this ?

Also need one confirmation : Is G1 GC used by default when running the
pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
JAVA8)



Regards,
Vinay Patil

On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
Mailing List archive.] <ml...@n4.nabble.com> wrote:

> Hi Vinay!
>
> Savepoints also call the same problematic RocksDB function, unfortunately.
>
> We will have a fix next month. We either (1) get a patched RocksDB version
> or we (2) implement a different pattern for ListState in Flink.
>
> (1) would be the better solution, so we are waiting for a response from
> the RocksDB folks. (2) is always possible if we cannot get a fix from
> RocksDB.
>
> Stephan
>
>
> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>
>> Hi Stephan,
>>
>> Thank you for making me aware of this.
>>
>> Yes I am using a window without reduce function (Apply function). The
>> discussion happening on JIRA is exactly what I am observing, consistent
>> failure of checkpoints after some time and the stream halts.
>>
>> We want to go live in next month, not sure how this will affect in
>> production as we are going to get above 200 million data.
>>
>> As a workaround can I take the savepoint while the pipeline is running ?
>> Let's say if I take savepoint after every 30minutes, will it work ?
>>
>>
>>
>> Regards,
>> Vinay Patil
>>
>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>> Mailing List archive.] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>
>>> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756
>>>
>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>
>>>> Hi Vinay,
>>>>
>>>> I think the issue is tracked here: https://github.com/faceb
>>>> ook/rocksdb/issues/1988.
>>>>
>>>> Best,
>>>> Stefan
>>>>
>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>
>>>> Hi Stephan,
>>>>
>>>> Is there a ticket number/link to track this, My job has all the
>>>> conditions you mentioned.
>>>>
>>>> Thanks,
>>>> Vishnu
>>>>
>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>
>>>>> Hi Vinay!
>>>>>
>>>>> We just discovered a bug in RocksDB. The bug affects windows without
>>>>> reduce() or fold(), windows with evictors, and ListState.
>>>>>
>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>> certain size-per-key that it basically brings down the streaming program
>>>>> and the snapshots.
>>>>>
>>>>> We are reaching out to the RocksDB folks and looking for workarounds
>>>>> in Flink.
>>>>>
>>>>> Greetings,
>>>>> Stephan
>>>>>
>>>>>
>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>
>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>> value, that can be a cause of problems...
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>
>>>>>>> Vinay,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The bucketing sink performs rename operations during the checkpoint
>>>>>>> and if it tries to rename a file that is not yet consistent that would
>>>>>>> cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Stephan,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Currently my aws fork contains some very specific assumptions about
>>>>>>> the pipeline that will in general only hold for my pipeline. This is
>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>> with more specific.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Seth Wiesman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>> *Reply-To: *"[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>> *To: *"[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>
>>>>>>>
>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Seth,
>>>>>>>
>>>>>>> Thank you for your suggestion.
>>>>>>>
>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>> HDFS only )
>>>>>>>
>>>>>>> Stephan,
>>>>>>>
>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>> expecting pending files here.
>>>>>>>
>>>>>>> This issue gets worst when checkpointing is disabled  as nothing is
>>>>>>> written.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Vinay Patil
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi Seth!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Wow, that is an awesome approach.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>> connector that Flink currently relies on): https://issues.apache.org
>>>>>>> /jira/browse/FLINK-5706
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Do you think your patch would be a good starting point for that and
>>>>>>> would you be willing to share it?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>> possible to fork officially, if necessary...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Greetings,
>>>>>>>
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>
>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>> was two fold:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source of
>>>>>>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>> up to a certain number of times.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>> together.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Seth Wiesman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>> *Reply-To: *"[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>> *To: *"[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> HI Stephan,
>>>>>>>
>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing the
>>>>>>> data, and using HDFS for storing checkpoints.
>>>>>>>
>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>
>>>>>>>
>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>
>>>>>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes
>>>>>>> of run the state size grows to 30GB ,  after checkpointing the 30GB state
>>>>>>> that is maintained in rocksDB has to be copied to HDFS, right ?  is this
>>>>>>> causing the pipeline to stall ?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Vinay Patil
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi Stephan,
>>>>>>>
>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the S3
>>>>>>> sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>
>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Vinay Patil
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>
>>>>>>> Flink's state backends currently do a good number of "make sure this
>>>>>>> exists" operations on the file systems. Through Hadoop's S3 filesystem,
>>>>>>> that translates to S3 bucket list operations, where there is a limit in how
>>>>>>> many operation may happen per time interval. After that, S3 blocks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>> that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We are gradually trying to improve the behavior there and be more S3
>>>>>>> aware.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>> there.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>
>>>>>>> Hi Stephan,
>>>>>>>
>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned in
>>>>>>> my previous mail, I could not see any progress for 16minutes as checkpoints
>>>>>>> were getting failed continuously.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>
>>>>>>> Hi Vinay!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>> asynchronously checkpointed.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That would be a super important fix to add!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have attached a snapshot for reference:
>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2
>>>>>>> and 3 it
>>>>>>> is stuck at the Kafka source after 50%
>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by source
>>>>>>> 2 is
>>>>>>> 15GB )
>>>>>>>
>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>> 16minutes the
>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>> checkpoints getting failed consistently.
>>>>>>>
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>
>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>> archive at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>> discussion below:*
>>>>>>>
>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>> 2p11885.html
>>>>>>>
>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>> email [hidden email]
>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>> here.
>>>>>>> NAML
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>> statebackend
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>
>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>> archive
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>> at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>> discussion below:*
>>>>>>>
>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>> 2p11891.html
>>>>>>>
>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>> email [hidden email]
>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>> here.
>>>>>>> NAML
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>> statebackend
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>> archive
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>> at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>> discussion below:*
>>>>>>>
>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>> 2p11943.html
>>>>>>>
>>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>>> email [hidden email]
>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>> here.
>>>>>>> NAML
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>> statebackend
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>>> archive
>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>> at Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12209.html
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p12225.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

@ Stephan,

I am not using explicit Evictor in my code. I will try using the Fold
function if it does not break my existing functionality :)

@Robert : Thank you for your answer, yes I have already tried to set G1GC
 this morning using env.java.opts, it works.
Which is the recommended GC for Streaming application (running on YARN -
EMR ) ?

Regards,
Vinay Patil

On Thu, Mar 16, 2017 at 6:36 PM, rmetzger0 [via Apache Flink User Mailing
List archive.] <ml...@n4.nabble.com> wrote:

> Yes, you can change the GC using the env.java.opts parameter.
> We are not setting any GC on YARN.
>
> On Thu, Mar 16, 2017 at 1:50 PM, Stephan Ewen <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12244&i=0>> wrote:
>
>> The only immediate workaround is to use windows with "reduce" or "fold"
>> or "aggregate" and not "apply". And to not use an evictor.
>>
>> The good news is that I think we have a good way of fixing this soon,
>> making an adjustment in RocksDB.
>>
>> For the Yarn / g1gc question: Not 100% sure about that - you can check if
>> it used g1gc. If not, you may be able to pass this through the
>> "env.java.opts" parameter. (cc robert for confirmation)
>>
>> Stephan
>>
>>
>>
>> On Thu, Mar 16, 2017 at 8:31 AM, vinay patil <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12244&i=1>> wrote:
>>
>>> Hi Stephan,
>>>
>>> What can be the workaround for this ?
>>>
>>> Also need one confirmation : Is G1 GC used by default when running the
>>> pipeline on YARN. (I see a thread of 2015 where G1 is used by default for
>>> JAVA8)
>>>
>>>
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> On Wed, Mar 15, 2017 at 10:32 PM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=0>> wrote:
>>>
>>>> Hi Vinay!
>>>>
>>>> Savepoints also call the same problematic RocksDB function,
>>>> unfortunately.
>>>>
>>>> We will have a fix next month. We either (1) get a patched RocksDB
>>>> version or we (2) implement a different pattern for ListState in Flink.
>>>>
>>>> (1) would be the better solution, so we are waiting for a response from
>>>> the RocksDB folks. (2) is always possible if we cannot get a fix from
>>>> RocksDB.
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12225&i=0>> wrote:
>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> Thank you for making me aware of this.
>>>>>
>>>>> Yes I am using a window without reduce function (Apply function). The
>>>>> discussion happening on JIRA is exactly what I am observing, consistent
>>>>> failure of checkpoints after some time and the stream halts.
>>>>>
>>>>> We want to go live in next month, not sure how this will affect in
>>>>> production as we are going to get above 200 million data.
>>>>>
>>>>> As a workaround can I take the savepoint while the pipeline is running
>>>>> ? Let's say if I take savepoint after every 30minutes, will it work ?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Vinay Patil
>>>>>
>>>>> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
>>>>> Mailing List archive.] <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>>>>>
>>>>>> The issue in Flink is https://issues.apache.org/j
>>>>>> ira/browse/FLINK-5756
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>>>>>
>>>>>>> Hi Vinay,
>>>>>>>
>>>>>>> I think the issue is tracked here: https://github.com/faceb
>>>>>>> ook/rocksdb/issues/1988.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>>>>>
>>>>>>> Hi Stephan,
>>>>>>>
>>>>>>> Is there a ticket number/link to track this, My job has all the
>>>>>>> conditions you mentioned.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vishnu
>>>>>>>
>>>>>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>>>>>
>>>>>>>> Hi Vinay!
>>>>>>>>
>>>>>>>> We just discovered a bug in RocksDB. The bug affects windows
>>>>>>>> without reduce() or fold(), windows with evictors, and ListState.
>>>>>>>>
>>>>>>>> A certain access pattern in RocksDB starts being so slow after a
>>>>>>>> certain size-per-key that it basically brings down the streaming program
>>>>>>>> and the snapshots.
>>>>>>>>
>>>>>>>> We are reaching out to the RocksDB folks and looking for
>>>>>>>> workarounds in Flink.
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>>>>>
>>>>>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>>>>>> actually not sure what would be the effect of setting it to a negative
>>>>>>>>> value, that can be a cause of problems...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>>>>>
>>>>>>>>>> Vinay,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The bucketing sink performs rename operations during the
>>>>>>>>>> checkpoint and if it tries to rename a file that is not yet consistent that
>>>>>>>>>> would cause a FileNotFound exception which would fail the checkpoint.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Stephan,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Currently my aws fork contains some very specific assumptions
>>>>>>>>>> about the pipeline that will in general only hold for my pipeline. This is
>>>>>>>>>> because there were still some open questions that  I had about how to solve
>>>>>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>>>>>> with more specific.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Seth Wiesman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Seth,
>>>>>>>>>>
>>>>>>>>>> Thank you for your suggestion.
>>>>>>>>>>
>>>>>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>>>>>> HDFS only )
>>>>>>>>>>
>>>>>>>>>> Stephan,
>>>>>>>>>>
>>>>>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and
>>>>>>>>>> keep the checkpoint interval to 10minutes, I have observed that nothing
>>>>>>>>>> gets written to sink (tried with S3 as well as HDFS), atleast I was
>>>>>>>>>> expecting pending files here.
>>>>>>>>>>
>>>>>>>>>> This issue gets worst when checkpointing is disabled  as nothing
>>>>>>>>>> is written.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Seth!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Wow, that is an awesome approach.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have actually seen these issues as well and we are looking to
>>>>>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>>>>>> connector that Flink currently relies on):
>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-5706
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you think your patch would be a good starting point for that
>>>>>>>>>> and would you be willing to share it?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is
>>>>>>>>>> possible to fork officially, if necessary...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Just wanted to throw in my 2cts.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>>>>>> was two fold:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source
>>>>>>>>>> of truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>>>>>> up to a certain number of times.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>>>>>> together.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Seth Wiesman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *From: *vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>>>>>> *Reply-To: *"[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden
>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>>>>>> *To: *"[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden
>>>>>>>>>> email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> HI Stephan,
>>>>>>>>>>
>>>>>>>>>> Just to avoid the confusion here, I am using S3 sink for writing
>>>>>>>>>> the data, and using HDFS for storing checkpoints.
>>>>>>>>>>
>>>>>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>>>>>
>>>>>>>>>> Let's say the checkpoint interval is 5 minutes, now within
>>>>>>>>>> 5minutes of run the state size grows to 30GB ,  after checkpointing the
>>>>>>>>>> 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?
>>>>>>>>>> is this causing the pipeline to stall ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Stephan,
>>>>>>>>>>
>>>>>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the
>>>>>>>>>> S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>>>>>
>>>>>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>>>>>> completed , however there is a hit in overall throughput.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Vinay Patil
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink
>>>>>>>>>> User Mailing List archive.] <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Flink's state backends currently do a good number of "make sure
>>>>>>>>>> this exists" operations on the file systems. Through Hadoop's S3
>>>>>>>>>> filesystem, that translates to S3 bucket list operations, where there is a
>>>>>>>>>> limit in how many operation may happen per time interval. After that, S3
>>>>>>>>>> blocks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It seems that operations that are totally cheap on HDFS are
>>>>>>>>>> hellishly expensive (and limited) on S3. It may be that you are affected by
>>>>>>>>>> that.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We are gradually trying to improve the behavior there and be more
>>>>>>>>>> S3 aware.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Stephan,
>>>>>>>>>>
>>>>>>>>>> So do you mean that S3 is causing the stall , as I have mentioned
>>>>>>>>>> in my previous mail, I could not see any progress for 16minutes as
>>>>>>>>>> checkpoints were getting failed continuously.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User
>>>>>>>>>> Mailing List archive.]" <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Vinay!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> True, the operator state (like Kafka) is currently not
>>>>>>>>>> asynchronously checkpointed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> While it is rather small state, we have seen before that on S3 it
>>>>>>>>>> can cause trouble, because S3 frequently stalls uploads of even data
>>>>>>>>>> amounts as low as kilobytes due to its throttling policies.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That would be a super important fix to add!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have attached a snapshot for reference:
>>>>>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID
>>>>>>>>>> 2 and 3 it
>>>>>>>>>> is stuck at the Kafka source after 50%
>>>>>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by
>>>>>>>>>> source 2 is
>>>>>>>>>> 15GB )
>>>>>>>>>>
>>>>>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>>>>>> 16minutes the
>>>>>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>>>>>> checkpoints getting failed consistently.
>>>>>>>>>>
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>>>>>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11885.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>> click here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>>>>>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11891.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive.,
>>>>>>>>>> click here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> *If you reply to this email, your message will be added to the
>>>>>>>>>> discussion below:*
>>>>>>>>>>
>>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>>>>>> 2p11943.html
>>>>>>>>>>
>>>>>>>>>> To start a new topic under Apache Flink User Mailing List
>>>>>>>>>> archive., email [hidden email]
>>>>>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>>>>>> here.
>>>>>>>>>> NAML
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>>>>>> statebackend
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>>>>>> Sent from the Apache Flink User Mailing List archive. mailing
>>>>>>>>>> list archive
>>>>>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>>>>>> at Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p12209.html
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>> at Nabble.com.
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> If you reply to this email, your message will be added to the
>>>> discussion below:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p12225.html
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12234&i=1>
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>> ------------------------------
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12234.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>
>>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p12244.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12245.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Stephan Ewen <se...@apache.org>.

Hi Vinay!

Savepoints also call the same problematic RocksDB function, unfortunately.

We will have a fix next month. We either (1) get a patched RocksDB version
or we (2) implement a different pattern for ListState in Flink.

(1) would be the better solution, so we are waiting for a response from the
RocksDB folks. (2) is always possible if we cannot get a fix from RocksDB.

Stephan


On Wed, Mar 15, 2017 at 5:53 PM, vinay patil <vi...@gmail.com>
wrote:

> Hi Stephan,
>
> Thank you for making me aware of this.
>
> Yes I am using a window without reduce function (Apply function). The
> discussion happening on JIRA is exactly what I am observing, consistent
> failure of checkpoints after some time and the stream halts.
>
> We want to go live in next month, not sure how this will affect in
> production as we are going to get above 200 million data.
>
> As a workaround can I take the savepoint while the pipeline is running ?
> Let's say if I take savepoint after every 30minutes, will it work ?
>
>
>
> Regards,
> Vinay Patil
>
> On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12224&i=0>> wrote:
>
>> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756
>>
>> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>>
>>> Hi Vinay,
>>>
>>> I think the issue is tracked here: https://github.com/faceb
>>> ook/rocksdb/issues/1988.
>>>
>>> Best,
>>> Stefan
>>>
>>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>>
>>> Hi Stephan,
>>>
>>> Is there a ticket number/link to track this, My job has all the
>>> conditions you mentioned.
>>>
>>> Thanks,
>>> Vishnu
>>>
>>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>>
>>>> Hi Vinay!
>>>>
>>>> We just discovered a bug in RocksDB. The bug affects windows without
>>>> reduce() or fold(), windows with evictors, and ListState.
>>>>
>>>> A certain access pattern in RocksDB starts being so slow after a
>>>> certain size-per-key that it basically brings down the streaming program
>>>> and the snapshots.
>>>>
>>>> We are reaching out to the RocksDB folks and looking for workarounds in
>>>> Flink.
>>>>
>>>> Greetings,
>>>> Stephan
>>>>
>>>>
>>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>>
>>>>> @vinay  Can you try to not set the buffer timeout at all? I am
>>>>> actually not sure what would be the effect of setting it to a negative
>>>>> value, that can be a cause of problems...
>>>>>
>>>>>
>>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>>
>>>>>> Vinay,
>>>>>>
>>>>>>
>>>>>>
>>>>>> The bucketing sink performs rename operations during the checkpoint
>>>>>> and if it tries to rename a file that is not yet consistent that would
>>>>>> cause a FileNotFound exception which would fail the checkpoint.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Stephan,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Currently my aws fork contains some very specific assumptions about
>>>>>> the pipeline that will in general only hold for my pipeline. This is
>>>>>> because there were still some open questions that  I had about how to solve
>>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>>> with more specific.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Seth Wiesman
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From: *vinay patil <[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>>> *Reply-To: *"[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden
>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>>> *To: *"[hidden email]
>>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden
>>>>>> email] <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>>
>>>>>>
>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Seth,
>>>>>>
>>>>>> Thank you for your suggestion.
>>>>>>
>>>>>> But if the issue is only related to S3, then why does this happen
>>>>>> when I replace the S3 sink  to HDFS as well (for checkpointing I am using
>>>>>> HDFS only )
>>>>>>
>>>>>> Stephan,
>>>>>>
>>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep
>>>>>> the checkpoint interval to 10minutes, I have observed that nothing gets
>>>>>> written to sink (tried with S3 as well as HDFS), atleast I was expecting
>>>>>> pending files here.
>>>>>>
>>>>>> This issue gets worst when checkpointing is disabled  as nothing is
>>>>>> written.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Vinay Patil
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User
>>>>>> Mailing List archive.] <[hidden email]> wrote:
>>>>>>
>>>>>> Hi Seth!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Wow, that is an awesome approach.
>>>>>>
>>>>>>
>>>>>>
>>>>>> We have actually seen these issues as well and we are looking to
>>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>>> connector that Flink currently relies on): https://issues.apache.org
>>>>>> /jira/browse/FLINK-5706
>>>>>>
>>>>>>
>>>>>>
>>>>>> Do you think your patch would be a good starting point for that and
>>>>>> would you be willing to share it?
>>>>>>
>>>>>>
>>>>>>
>>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible
>>>>>> to fork officially, if necessary...
>>>>>>
>>>>>>
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>>
>>>>>> Just wanted to throw in my 2cts.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>>> was two fold:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1)       I forked hadoop-aws and have it treat flink as a source of
>>>>>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>>> is not. In this case, various operations will perform a back off and retry
>>>>>> up to a certain number of times.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>>> second rename of file can never be assured to work and will eventually fail
>>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>>> together.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope this helps,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Seth Wiesman
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From: *vinay patil <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>>> *Reply-To: *"[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>>> *To: *"[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>>
>>>>>>
>>>>>>
>>>>>> HI Stephan,
>>>>>>
>>>>>> Just to avoid the confusion here, I am using S3 sink for writing the
>>>>>> data, and using HDFS for storing checkpoints.
>>>>>>
>>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>>
>>>>>>
>>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>>
>>>>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes
>>>>>> of run the state size grows to 30GB ,  after checkpointing the 30GB state
>>>>>> that is maintained in rocksDB has to be copied to HDFS, right ?  is this
>>>>>> causing the pipeline to stall ?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Vinay Patil
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>>>>>>
>>>>>> Hi Stephan,
>>>>>>
>>>>>> To verify if S3 is making teh pipeline stall, I have replaced the S3
>>>>>> sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>>> still I see the same issue with checkpoints getting failed.
>>>>>>
>>>>>> If I keep the  pause time to 20 seconds, all checkpoints are
>>>>>> completed , however there is a hit in overall throughput.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Vinay Patil
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User
>>>>>> Mailing List archive.] <[hidden email]> wrote:
>>>>>>
>>>>>> Flink's state backends currently do a good number of "make sure this
>>>>>> exists" operations on the file systems. Through Hadoop's S3 filesystem,
>>>>>> that translates to S3 bucket list operations, where there is a limit in how
>>>>>> many operation may happen per time interval. After that, S3 blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> It seems that operations that are totally cheap on HDFS are hellishly
>>>>>> expensive (and limited) on S3. It may be that you are affected by that.
>>>>>>
>>>>>>
>>>>>>
>>>>>> We are gradually trying to improve the behavior there and be more S3
>>>>>> aware.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>>
>>>>>> Hi Stephan,
>>>>>>
>>>>>> So do you mean that S3 is causing the stall , as I have mentioned in
>>>>>> my previous mail, I could not see any progress for 16minutes as checkpoints
>>>>>> were getting failed continuously.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing
>>>>>> List archive.]" <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>>
>>>>>> Hi Vinay!
>>>>>>
>>>>>>
>>>>>>
>>>>>> True, the operator state (like Kafka) is currently not asynchronously
>>>>>> checkpointed.
>>>>>>
>>>>>>
>>>>>>
>>>>>> While it is rather small state, we have seen before that on S3 it can
>>>>>> cause trouble, because S3 frequently stalls uploads of even data amounts as
>>>>>> low as kilobytes due to its throttling policies.
>>>>>>
>>>>>>
>>>>>>
>>>>>> That would be a super important fix to add!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have attached a snapshot for reference:
>>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2
>>>>>> and 3 it
>>>>>> is stuck at the Kafka source after 50%
>>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by source
>>>>>> 2 is
>>>>>> 15GB )
>>>>>>
>>>>>> Within 10minutes 15M records were processed, and for the next
>>>>>> 16minutes the
>>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>>> checkpoints getting failed consistently.
>>>>>>
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context: http://apache-flink-user-maili
>>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>>
>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>> archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> *If you reply to this email, your message will be added to the
>>>>>> discussion below:*
>>>>>>
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p11885.html
>>>>>>
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>> statebackend
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>>
>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>> archive
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>> at Nabble.com.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> *If you reply to this email, your message will be added to the
>>>>>> discussion below:*
>>>>>>
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p11891.html
>>>>>>
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>> statebackend
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>> archive
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>> at Nabble.com.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> *If you reply to this email, your message will be added to the
>>>>>> discussion below:*
>>>>>>
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>>> 2p11943.html
>>>>>>
>>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>>> email [hidden email]
>>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>>> here.
>>>>>> NAML
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>>> statebackend
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>>> archive
>>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>>> at Nabble.com.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p12209.html
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=12224&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
> View this message in context: Re: Checkpointing with RocksDB as
> statebackend
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at
> Nabble.com.
>

Re: Checkpointing with RocksDB as statebackend

Posted by vinay patil <vi...@gmail.com>.

Hi Stephan,

Thank you for making me aware of this.

Yes I am using a window without reduce function (Apply function). The
discussion happening on JIRA is exactly what I am observing, consistent
failure of checkpoints after some time and the stream halts.

We want to go live in next month, not sure how this will affect in
production as we are going to get above 200 million data.

As a workaround can I take the savepoint while the pipeline is running ?
Let's say if I take savepoint after every 30minutes, will it work ?



Regards,
Vinay Patil

On Tue, Mar 14, 2017 at 10:02 PM, Stephan Ewen [via Apache Flink User
Mailing List archive.] <ml...@n4.nabble.com> wrote:

> The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756
>
> On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=12209&i=0>> wrote:
>
>> Hi Vinay,
>>
>> I think the issue is tracked here: https://github.com/faceb
>> ook/rocksdb/issues/1988.
>>
>> Best,
>> Stefan
>>
>> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12209&i=1>>:
>>
>> Hi Stephan,
>>
>> Is there a ticket number/link to track this, My job has all the
>> conditions you mentioned.
>>
>> Thanks,
>> Vishnu
>>
>> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=12209&i=2>> wrote:
>>
>>> Hi Vinay!
>>>
>>> We just discovered a bug in RocksDB. The bug affects windows without
>>> reduce() or fold(), windows with evictors, and ListState.
>>>
>>> A certain access pattern in RocksDB starts being so slow after a certain
>>> size-per-key that it basically brings down the streaming program and the
>>> snapshots.
>>>
>>> We are reaching out to the RocksDB folks and looking for workarounds in
>>> Flink.
>>>
>>> Greetings,
>>> Stephan
>>>
>>>
>>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=3>> wrote:
>>>
>>>> @vinay  Can you try to not set the buffer timeout at all? I am actually
>>>> not sure what would be the effect of setting it to a negative value, that
>>>> can be a cause of problems...
>>>>
>>>>
>>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <[hidden email]
>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=4>> wrote:
>>>>
>>>>> Vinay,
>>>>>
>>>>>
>>>>>
>>>>> The bucketing sink performs rename operations during the checkpoint
>>>>> and if it tries to rename a file that is not yet consistent that would
>>>>> cause a FileNotFound exception which would fail the checkpoint.
>>>>>
>>>>>
>>>>>
>>>>> Stephan,
>>>>>
>>>>>
>>>>>
>>>>> Currently my aws fork contains some very specific assumptions about
>>>>> the pipeline that will in general only hold for my pipeline. This is
>>>>> because there were still some open questions that  I had about how to solve
>>>>> consistency issues in the general case. I will comment on the Jira issue
>>>>> with more specific.
>>>>>
>>>>>
>>>>>
>>>>> Seth Wiesman
>>>>>
>>>>>
>>>>>
>>>>> *From: *vinay patil <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=5>>
>>>>> *Reply-To: *"[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=6>" <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=7>>
>>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>>> *To: *"[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=8>" <[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=12209&i=9>>
>>>>>
>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>
>>>>>
>>>>>
>>>>> Hi Seth,
>>>>>
>>>>> Thank you for your suggestion.
>>>>>
>>>>> But if the issue is only related to S3, then why does this happen when
>>>>> I replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS
>>>>> only )
>>>>>
>>>>> Stephan,
>>>>>
>>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep
>>>>> the checkpoint interval to 10minutes, I have observed that nothing gets
>>>>> written to sink (tried with S3 as well as HDFS), atleast I was expecting
>>>>> pending files here.
>>>>>
>>>>> This issue gets worst when checkpointing is disabled  as nothing is
>>>>> written.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Vinay Patil
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User
>>>>> Mailing List archive.] <[hidden email]> wrote:
>>>>>
>>>>> Hi Seth!
>>>>>
>>>>>
>>>>>
>>>>> Wow, that is an awesome approach.
>>>>>
>>>>>
>>>>>
>>>>> We have actually seen these issues as well and we are looking to
>>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>>> connector that Flink currently relies on): https://issues.apache.org
>>>>> /jira/browse/FLINK-5706
>>>>>
>>>>>
>>>>>
>>>>> Do you think your patch would be a good starting point for that and
>>>>> would you be willing to share it?
>>>>>
>>>>>
>>>>>
>>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible
>>>>> to fork officially, if necessary...
>>>>>
>>>>>
>>>>>
>>>>> Greetings,
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>>
>>>>> Just wanted to throw in my 2cts.
>>>>>
>>>>>
>>>>>
>>>>> I’ve been running pipelines with similar state size using rocksdb
>>>>> which externalize to S3 and bucket to S3. I was getting stalls like this
>>>>> and ended up tracing the problem to S3 and the bucketing sink. The solution
>>>>> was two fold:
>>>>>
>>>>>
>>>>>
>>>>> 1)       I forked hadoop-aws and have it treat flink as a source of
>>>>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>>> is not. In this case, various operations will perform a back off and retry
>>>>> up to a certain number of times.
>>>>>
>>>>>
>>>>>
>>>>> 2)       The bucketing sink performs multiple renames over the
>>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>>> second rename of file can never be assured to work and will eventually fail
>>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>>> solve this specific problem as it could take upwards of many minutes to
>>>>> rename which would stall the entire pipeline. The only viable solution I
>>>>> could find was to write a custom sink which understands S3. Each writer
>>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>>> together.
>>>>>
>>>>>
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>>
>>>>>
>>>>> Seth Wiesman
>>>>>
>>>>>
>>>>>
>>>>> *From: *vinay patil <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>>> *Reply-To: *"[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>>> *To: *"[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>>
>>>>>
>>>>>
>>>>> HI Stephan,
>>>>>
>>>>> Just to avoid the confusion here, I am using S3 sink for writing the
>>>>> data, and using HDFS for storing checkpoints.
>>>>>
>>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>>
>>>>>
>>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>>
>>>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes of
>>>>> run the state size grows to 30GB ,  after checkpointing the 30GB state that
>>>>> is maintained in rocksDB has to be copied to HDFS, right ?  is this causing
>>>>> the pipeline to stall ?
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Vinay Patil
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> To verify if S3 is making teh pipeline stall, I have replaced the S3
>>>>> sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>>> still I see the same issue with checkpoints getting failed.
>>>>>
>>>>> If I keep the  pause time to 20 seconds, all checkpoints are completed
>>>>> , however there is a hit in overall throughput.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Vinay Patil
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User
>>>>> Mailing List archive.] <[hidden email]> wrote:
>>>>>
>>>>> Flink's state backends currently do a good number of "make sure this
>>>>> exists" operations on the file systems. Through Hadoop's S3 filesystem,
>>>>> that translates to S3 bucket list operations, where there is a limit in how
>>>>> many operation may happen per time interval. After that, S3 blocks.
>>>>>
>>>>>
>>>>>
>>>>> It seems that operations that are totally cheap on HDFS are hellishly
>>>>> expensive (and limited) on S3. It may be that you are affected by that.
>>>>>
>>>>>
>>>>>
>>>>> We are gradually trying to improve the behavior there and be more S3
>>>>> aware.
>>>>>
>>>>>
>>>>>
>>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> So do you mean that S3 is causing the stall , as I have mentioned in
>>>>> my previous mail, I could not see any progress for 16minutes as checkpoints
>>>>> were getting failed continuously.
>>>>>
>>>>>
>>>>>
>>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing
>>>>> List archive.]" <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>>
>>>>> Hi Vinay!
>>>>>
>>>>>
>>>>>
>>>>> True, the operator state (like Kafka) is currently not asynchronously
>>>>> checkpointed.
>>>>>
>>>>>
>>>>>
>>>>> While it is rather small state, we have seen before that on S3 it can
>>>>> cause trouble, because S3 frequently stalls uploads of even data amounts as
>>>>> low as kilobytes due to its throttling policies.
>>>>>
>>>>>
>>>>>
>>>>> That would be a super important fix to add!
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have attached a snapshot for reference:
>>>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2
>>>>> and 3 it
>>>>> is stuck at the Kafka source after 50%
>>>>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2
>>>>> is
>>>>> 15GB )
>>>>>
>>>>> Within 10minutes 15M records were processed, and for the next
>>>>> 16minutes the
>>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>>> checkpoints getting failed consistently.
>>>>>
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://apache-flink-user-maili
>>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive at Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *If you reply to this email, your message will be added to the
>>>>> discussion below:*
>>>>>
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>> 2p11885.html
>>>>>
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>> at Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *If you reply to this email, your message will be added to the
>>>>> discussion below:*
>>>>>
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>> 2p11891.html
>>>>>
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>> at Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *If you reply to this email, your message will be added to the
>>>>> discussion below:*
>>>>>
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>>> 2p11943.html
>>>>>
>>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>>> email [hidden email]
>>>>> To unsubscribe from Apache Flink User Mailing List archive., click
>>>>> here.
>>>>> NAML
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>>> statebackend
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>>> archive
>>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>>> at Nabble.com.
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-
> Checkpointing-with-RocksDB-as-statebackend-tp11752p12209.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1h83@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p12224.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Checkpointing with RocksDB as statebackend

Posted by Stephan Ewen <se...@apache.org>.

The issue in Flink is https://issues.apache.org/jira/browse/FLINK-5756

On Tue, Mar 14, 2017 at 3:40 PM, Stefan Richter <s.richter@data-artisans.com
> wrote:

> Hi Vinay,
>
> I think the issue is tracked here: https://github.com/
> facebook/rocksdb/issues/1988.
>
> Best,
> Stefan
>
> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <
> vishnu.viswanath25@gmail.com>:
>
> Hi Stephan,
>
> Is there a ticket number/link to track this, My job has all the conditions
> you mentioned.
>
> Thanks,
> Vishnu
>
> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi Vinay!
>>
>> We just discovered a bug in RocksDB. The bug affects windows without
>> reduce() or fold(), windows with evictors, and ListState.
>>
>> A certain access pattern in RocksDB starts being so slow after a certain
>> size-per-key that it basically brings down the streaming program and the
>> snapshots.
>>
>> We are reaching out to the RocksDB folks and looking for workarounds in
>> Flink.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> @vinay  Can you try to not set the buffer timeout at all? I am actually
>>> not sure what would be the effect of setting it to a negative value, that
>>> can be a cause of problems...
>>>
>>>
>>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <sw...@mediamath.com>
>>> wrote:
>>>
>>>> Vinay,
>>>>
>>>>
>>>>
>>>> The bucketing sink performs rename operations during the checkpoint and
>>>> if it tries to rename a file that is not yet consistent that would cause a
>>>> FileNotFound exception which would fail the checkpoint.
>>>>
>>>>
>>>>
>>>> Stephan,
>>>>
>>>>
>>>>
>>>> Currently my aws fork contains some very specific assumptions about the
>>>> pipeline that will in general only hold for my pipeline. This is because
>>>> there were still some open questions that  I had about how to solve
>>>> consistency issues in the general case. I will comment on the Jira issue
>>>> with more specific.
>>>>
>>>>
>>>>
>>>> Seth Wiesman
>>>>
>>>>
>>>>
>>>> *From: *vinay patil <vi...@gmail.com>
>>>> *Reply-To: *"user@flink.apache.org" <us...@flink.apache.org>
>>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>>> *To: *"user@flink.apache.org" <us...@flink.apache.org>
>>>>
>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>
>>>>
>>>>
>>>> Hi Seth,
>>>>
>>>> Thank you for your suggestion.
>>>>
>>>> But if the issue is only related to S3, then why does this happen when
>>>> I replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS
>>>> only )
>>>>
>>>> Stephan,
>>>>
>>>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep
>>>> the checkpoint interval to 10minutes, I have observed that nothing gets
>>>> written to sink (tried with S3 as well as HDFS), atleast I was expecting
>>>> pending files here.
>>>>
>>>> This issue gets worst when checkpointing is disabled  as nothing is
>>>> written.
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Vinay Patil
>>>>
>>>>
>>>>
>>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User
>>>> Mailing List archive.] <[hidden email]> wrote:
>>>>
>>>> Hi Seth!
>>>>
>>>>
>>>>
>>>> Wow, that is an awesome approach.
>>>>
>>>>
>>>>
>>>> We have actually seen these issues as well and we are looking to
>>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>>> connector that Flink currently relies on): https://issues.apache.org
>>>> /jira/browse/FLINK-5706
>>>>
>>>>
>>>>
>>>> Do you think your patch would be a good starting point for that and
>>>> would you be willing to share it?
>>>>
>>>>
>>>>
>>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible
>>>> to fork officially, if necessary...
>>>>
>>>>
>>>>
>>>> Greetings,
>>>>
>>>> Stephan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>>
>>>> Just wanted to throw in my 2cts.
>>>>
>>>>
>>>>
>>>> I’ve been running pipelines with similar state size using rocksdb which
>>>> externalize to S3 and bucket to S3. I was getting stalls like this and
>>>> ended up tracing the problem to S3 and the bucketing sink. The solution was
>>>> two fold:
>>>>
>>>>
>>>>
>>>> 1)       I forked hadoop-aws and have it treat flink as a source of
>>>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>>> see it then I am going to trust that flink is in a consistent state and S3
>>>> is not. In this case, various operations will perform a back off and retry
>>>> up to a certain number of times.
>>>>
>>>>
>>>>
>>>> 2)       The bucketing sink performs multiple renames over the
>>>> lifetime of a file, occurring when a checkpoint starts and then again on
>>>> notification after it completes. Due to S3’s consistency guarantees the
>>>> second rename of file can never be assured to work and will eventually fail
>>>> either during or after a checkpoint. Because there is no upper bound on the
>>>> time it will take for a file on S3 to become consistent, retries cannot
>>>> solve this specific problem as it could take upwards of many minutes to
>>>> rename which would stall the entire pipeline. The only viable solution I
>>>> could find was to write a custom sink which understands S3. Each writer
>>>> will write file locally and then copy it to S3 on checkpoint. By only
>>>> interacting with S3 once per file it can circumvent consistency issues all
>>>> together.
>>>>
>>>>
>>>>
>>>> Hope this helps,
>>>>
>>>>
>>>>
>>>> Seth Wiesman
>>>>
>>>>
>>>>
>>>> *From: *vinay patil <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>>> *Reply-To: *"[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>>> *To: *"[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>>
>>>>
>>>>
>>>> HI Stephan,
>>>>
>>>> Just to avoid the confusion here, I am using S3 sink for writing the
>>>> data, and using HDFS for storing checkpoints.
>>>>
>>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>>
>>>>
>>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>>
>>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes of
>>>> run the state size grows to 30GB ,  after checkpointing the 30GB state that
>>>> is maintained in rocksDB has to be copied to HDFS, right ?  is this causing
>>>> the pipeline to stall ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Vinay Patil
>>>>
>>>>
>>>>
>>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>>>>
>>>> Hi Stephan,
>>>>
>>>> To verify if S3 is making teh pipeline stall, I have replaced the S3
>>>> sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>>> still I see the same issue with checkpoints getting failed.
>>>>
>>>> If I keep the  pause time to 20 seconds, all checkpoints are completed
>>>> , however there is a hit in overall throughput.
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Vinay Patil
>>>>
>>>>
>>>>
>>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User
>>>> Mailing List archive.] <[hidden email]> wrote:
>>>>
>>>> Flink's state backends currently do a good number of "make sure this
>>>> exists" operations on the file systems. Through Hadoop's S3 filesystem,
>>>> that translates to S3 bucket list operations, where there is a limit in how
>>>> many operation may happen per time interval. After that, S3 blocks.
>>>>
>>>>
>>>>
>>>> It seems that operations that are totally cheap on HDFS are hellishly
>>>> expensive (and limited) on S3. It may be that you are affected by that.
>>>>
>>>>
>>>>
>>>> We are gradually trying to improve the behavior there and be more S3
>>>> aware.
>>>>
>>>>
>>>>
>>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Stephan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>>
>>>> Hi Stephan,
>>>>
>>>> So do you mean that S3 is causing the stall , as I have mentioned in my
>>>> previous mail, I could not see any progress for 16minutes as checkpoints
>>>> were getting failed continuously.
>>>>
>>>>
>>>>
>>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing
>>>> List archive.]" <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>>
>>>> Hi Vinay!
>>>>
>>>>
>>>>
>>>> True, the operator state (like Kafka) is currently not asynchronously
>>>> checkpointed.
>>>>
>>>>
>>>>
>>>> While it is rather small state, we have seen before that on S3 it can
>>>> cause trouble, because S3 frequently stalls uploads of even data amounts as
>>>> low as kilobytes due to its throttling policies.
>>>>
>>>>
>>>>
>>>> That would be a super important fix to add!
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Stephan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have attached a snapshot for reference:
>>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and
>>>> 3 it
>>>> is stuck at the Kafka source after 50%
>>>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2
>>>> is
>>>> 15GB )
>>>>
>>>> Within 10minutes 15M records were processed, and for the next 16minutes
>>>> the
>>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>>> checkpoints getting failed consistently.
>>>>
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-flink-user-maili
>>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>>> RocksDB-as-statebackend-tp11752p11882.html
>>>>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive at Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *If you reply to this email, your message will be added to the
>>>> discussion below:*
>>>>
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p11885.html
>>>>
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>> statebackend
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *If you reply to this email, your message will be added to the
>>>> discussion below:*
>>>>
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p11891.html
>>>>
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>> statebackend
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *If you reply to this email, your message will be added to the
>>>> discussion below:*
>>>>
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp1175
>>>> 2p11943.html
>>>>
>>>> To start a new topic under Apache Flink User Mailing List archive.,
>>>> email [hidden email]
>>>> To unsubscribe from Apache Flink User Mailing List archive., click here
>>>> .
>>>> NAML
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> View this message in context: Re: Checkpointing with RocksDB as
>>>> statebackend
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>>> archive
>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>>> at Nabble.com.
>>>>
>>>>
>>>
>>
>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Stefan Richter <s....@data-artisans.com>.

Hi Vinay,

I think the issue is tracked here: https://github.com/facebook/rocksdb/issues/1988 <https://github.com/facebook/rocksdb/issues/1988>.

Best,
Stefan

> Am 14.03.2017 um 15:31 schrieb Vishnu Viswanath <vi...@gmail.com>:
> 
> Hi Stephan,
> 
> Is there a ticket number/link to track this, My job has all the conditions you mentioned.
> 
> Thanks,
> Vishnu
> 
> On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <sewen@apache.org <ma...@apache.org>> wrote:
> Hi Vinay!
> 
> We just discovered a bug in RocksDB. The bug affects windows without reduce() or fold(), windows with evictors, and ListState.
> 
> A certain access pattern in RocksDB starts being so slow after a certain size-per-key that it basically brings down the streaming program and the snapshots.
> 
> We are reaching out to the RocksDB folks and looking for workarounds in Flink.
> 
> Greetings,
> Stephan
> 
> 
> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <sewen@apache.org <ma...@apache.org>> wrote:
> @vinay  Can you try to not set the buffer timeout at all? I am actually not sure what would be the effect of setting it to a negative value, that can be a cause of problems...
> 
> 
> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <swiesman@mediamath.com <ma...@mediamath.com>> wrote:
> Vinay,
> 
>  
> 
> The bucketing sink performs rename operations during the checkpoint and if it tries to rename a file that is not yet consistent that would cause a FileNotFound exception which would fail the checkpoint.
> 
>  
> 
> Stephan,
> 
>  
> 
> Currently my aws fork contains some very specific assumptions about the pipeline that will in general only hold for my pipeline. This is because there were still some open questions that  I had about how to solve consistency issues in the general case. I will comment on the Jira issue with more specific.
> 
>  
> 
> Seth Wiesman
> 
>  
> 
> From: vinay patil <vinay18.patil@gmail.com <ma...@gmail.com>>
> Reply-To: "user@flink.apache.org <ma...@flink.apache.org>" <user@flink.apache.org <ma...@flink.apache.org>>
> Date: Monday, February 27, 2017 at 1:05 PM
> To: "user@flink.apache.org <ma...@flink.apache.org>" <user@flink.apache.org <ma...@flink.apache.org>>
> 
> 
> Subject: Re: Checkpointing with RocksDB as statebackend
> 
>  
> 
> Hi Seth,
> 
> Thank you for your suggestion.
> 
> But if the issue is only related to S3, then why does this happen when I replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS only )
> 
> Stephan,
> 
> Another issue I see is when I set env.setBufferTimeout(-1) , and keep the checkpoint interval to 10minutes, I have observed that nothing gets written to sink (tried with S3 as well as HDFS), atleast I was expecting pending files here.
> 
> This issue gets worst when checkpointing is disabled  as nothing is written.
> 
>  
> 
> 
> 
> Regards,
> 
> Vinay Patil
> 
>  
> 
> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email] <>> wrote:
> 
> Hi Seth!
> 
>  
> 
> Wow, that is an awesome approach.
> 
>  
> 
> We have actually seen these issues as well and we are looking to eventually implement our own S3 file system (and circumvent Hadoop's S3 connector that Flink currently relies on): https://issues.apache.org/jira/browse/FLINK-5706 <https://issues.apache.org/jira/browse/FLINK-5706>
>  
> 
> Do you think your patch would be a good starting point for that and would you be willing to share it?
> 
>  
> 
> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible to fork officially, if necessary...
> 
>  
> 
> Greetings,
> 
> Stephan
> 
>  
> 
>  
> 
>  
> 
> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
> 
> Just wanted to throw in my 2cts.  
> 
>  
> 
> I’ve been running pipelines with similar state size using rocksdb which externalize to S3 and bucket to S3. I was getting stalls like this and ended up tracing the problem to S3 and the bucketing sink. The solution was two fold:
> 
>  
> 
> 1)       I forked hadoop-aws and have it treat flink as a source of truth. Emr uses a dynamodb table to determine if S3 is inconsistent. Instead I say that if flink believes that a file exists on S3 and we don’t see it then I am going to trust that flink is in a consistent state and S3 is not. In this case, various operations will perform a back off and retry up to a certain number of times.
> 
>  
> 
> 2)       The bucketing sink performs multiple renames over the lifetime of a file, occurring when a checkpoint starts and then again on notification after it completes. Due to S3’s consistency guarantees the second rename of file can never be assured to work and will eventually fail either during or after a checkpoint. Because there is no upper bound on the time it will take for a file on S3 to become consistent, retries cannot solve this specific problem as it could take upwards of many minutes to rename which would stall the entire pipeline. The only viable solution I could find was to write a custom sink which understands S3. Each writer will write file locally and then copy it to S3 on checkpoint. By only interacting with S3 once per file it can circumvent consistency issues all together.
> 
>  
> 
> Hope this helps,
> 
>  
> 
> Seth Wiesman
> 
>  
> 
> From: vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
> Reply-To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
> Date: Saturday, February 25, 2017 at 10:50 AM
> To: "[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email] <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
> Subject: Re: Checkpointing with RocksDB as statebackend
> 
>  
> 
> HI Stephan,
> 
> Just to avoid the confusion here, I am using S3 sink for writing the data, and using HDFS for storing checkpoints.
> 
> There are 2 core nodes (HDFS) and two task nodes on EMR
> 
> 
> I replaced s3 sink with HDFS for writing data in my last test.
> 
> Let's say the checkpoint interval is 5 minutes, now within 5minutes of run the state size grows to 30GB ,  after checkpointing the 30GB state that is maintained in rocksDB has to be copied to HDFS, right ?  is this causing the pipeline to stall ?
> 
> 
> 
> Regards,
> 
> Vinay Patil
> 
>  
> 
> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
> 
> Hi Stephan,
> 
> To verify if S3 is making teh pipeline stall, I have replaced the S3 sink with HDFS and kept minimum pause between checkpoints to 5minutes, still I see the same issue with checkpoints getting failed.
> 
> If I keep the  pause time to 20 seconds, all checkpoints are completed , however there is a hit in overall throughput.
> 
>  
> 
> 
> 
> Regards,
> 
> Vinay Patil
> 
>  
> 
> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:
> 
> Flink's state backends currently do a good number of "make sure this exists" operations on the file systems. Through Hadoop's S3 filesystem, that translates to S3 bucket list operations, where there is a limit in how many operation may happen per time interval. After that, S3 blocks.
> 
>  
> 
> It seems that operations that are totally cheap on HDFS are hellishly expensive (and limited) on S3. It may be that you are affected by that.
> 
>  
> 
> We are gradually trying to improve the behavior there and be more S3 aware.
> 
>  
> 
> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
> 
>  
> 
> Best,
> 
> Stephan
> 
>  
> 
>  
> 
> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
> 
> Hi Stephan,
> 
> So do you mean that S3 is causing the stall , as I have mentioned in my previous mail, I could not see any progress for 16minutes as checkpoints were getting failed continuously.
> 
>  
> 
> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing List archive.]" <[hidden email] <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
> 
> Hi Vinay!
> 
>  
> 
> True, the operator state (like Kafka) is currently not asynchronously checkpointed.
> 
>  
> 
> While it is rather small state, we have seen before that on S3 it can cause trouble, because S3 frequently stalls uploads of even data amounts as low as kilobytes due to its throttling policies.
> 
>  
> 
> That would be a super important fix to add!
> 
>  
> 
> Best,
> 
> Stephan
> 
>  
> 
>  
> 
> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email] <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
> 
> Hi,
> 
> I have attached a snapshot for reference:
> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and 3 it
> is stuck at the Kafka source after 50%
> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 is
> 15GB )
> 
> Within 10minutes 15M records were processed, and for the next 16minutes the
> pipeline is stuck , I don't see any progress beyond 15M because of
> checkpoints getting failed consistently.
> 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11882/Checkpointing_Failed.png <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11882/Checkpointing_Failed.png>>
> 
> 
> 
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11882.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11882.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
> 
>  
> 
>  
> 
> If you reply to this email, your message will be added to the discussion below:
> 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <http://user/SendEmail.jtp?type=node&node=11887&i=1> 
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>  
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com.
> 
>  
> 
>  
> 
> If you reply to this email, your message will be added to the discussion below:
> 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] 
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>  
> 
>  
> 
>  
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com.
> 
>  
> 
>  
> 
> If you reply to this email, your message will be added to the discussion below:
> 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11943.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11943.html>
> To start a new topic under Apache Flink User Mailing List archive., email [hidden email] <> 
> To unsubscribe from Apache Flink User Mailing List archive., click here <>.
> NAML <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>  
> 
>  
> 
> View this message in context: Re: Checkpointing with RocksDB as statebackend <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
> Sent from the Apache Flink User Mailing List archive. mailing list archive <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at Nabble.com.
> 
> 
> 
> 
>

Re: Checkpointing with RocksDB as statebackend

Posted by Vishnu Viswanath <vi...@gmail.com>.

Hi Stephan,

Is there a ticket number/link to track this, My job has all the conditions
you mentioned.

Thanks,
Vishnu

On Tue, Mar 14, 2017 at 7:13 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi Vinay!
>
> We just discovered a bug in RocksDB. The bug affects windows without
> reduce() or fold(), windows with evictors, and ListState.
>
> A certain access pattern in RocksDB starts being so slow after a certain
> size-per-key that it basically brings down the streaming program and the
> snapshots.
>
> We are reaching out to the RocksDB folks and looking for workarounds in
> Flink.
>
> Greetings,
> Stephan
>
>
> On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> @vinay  Can you try to not set the buffer timeout at all? I am actually
>> not sure what would be the effect of setting it to a negative value, that
>> can be a cause of problems...
>>
>>
>> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <sw...@mediamath.com>
>> wrote:
>>
>>> Vinay,
>>>
>>>
>>>
>>> The bucketing sink performs rename operations during the checkpoint and
>>> if it tries to rename a file that is not yet consistent that would cause a
>>> FileNotFound exception which would fail the checkpoint.
>>>
>>>
>>>
>>> Stephan,
>>>
>>>
>>>
>>> Currently my aws fork contains some very specific assumptions about the
>>> pipeline that will in general only hold for my pipeline. This is because
>>> there were still some open questions that  I had about how to solve
>>> consistency issues in the general case. I will comment on the Jira issue
>>> with more specific.
>>>
>>>
>>>
>>> Seth Wiesman
>>>
>>>
>>>
>>> *From: *vinay patil <vi...@gmail.com>
>>> *Reply-To: *"user@flink.apache.org" <us...@flink.apache.org>
>>> *Date: *Monday, February 27, 2017 at 1:05 PM
>>> *To: *"user@flink.apache.org" <us...@flink.apache.org>
>>>
>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>
>>>
>>>
>>> Hi Seth,
>>>
>>> Thank you for your suggestion.
>>>
>>> But if the issue is only related to S3, then why does this happen when I
>>> replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS
>>> only )
>>>
>>> Stephan,
>>>
>>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep
>>> the checkpoint interval to 10minutes, I have observed that nothing gets
>>> written to sink (tried with S3 as well as HDFS), atleast I was expecting
>>> pending files here.
>>>
>>> This issue gets worst when checkpointing is disabled  as nothing is
>>> written.
>>>
>>>
>>>
>>>
>>> Regards,
>>>
>>> Vinay Patil
>>>
>>>
>>>
>>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <[hidden email]> wrote:
>>>
>>> Hi Seth!
>>>
>>>
>>>
>>> Wow, that is an awesome approach.
>>>
>>>
>>>
>>> We have actually seen these issues as well and we are looking to
>>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>>> connector that Flink currently relies on): https://issues.apache.org
>>> /jira/browse/FLINK-5706
>>>
>>>
>>>
>>> Do you think your patch would be a good starting point for that and
>>> would you be willing to share it?
>>>
>>>
>>>
>>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible to
>>> fork officially, if necessary...
>>>
>>>
>>>
>>> Greetings,
>>>
>>> Stephan
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>>
>>> Just wanted to throw in my 2cts.
>>>
>>>
>>>
>>> I’ve been running pipelines with similar state size using rocksdb which
>>> externalize to S3 and bucket to S3. I was getting stalls like this and
>>> ended up tracing the problem to S3 and the bucketing sink. The solution was
>>> two fold:
>>>
>>>
>>>
>>> 1)       I forked hadoop-aws and have it treat flink as a source of
>>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>>> Instead I say that if flink believes that a file exists on S3 and we don’t
>>> see it then I am going to trust that flink is in a consistent state and S3
>>> is not. In this case, various operations will perform a back off and retry
>>> up to a certain number of times.
>>>
>>>
>>>
>>> 2)       The bucketing sink performs multiple renames over the lifetime
>>> of a file, occurring when a checkpoint starts and then again on
>>> notification after it completes. Due to S3’s consistency guarantees the
>>> second rename of file can never be assured to work and will eventually fail
>>> either during or after a checkpoint. Because there is no upper bound on the
>>> time it will take for a file on S3 to become consistent, retries cannot
>>> solve this specific problem as it could take upwards of many minutes to
>>> rename which would stall the entire pipeline. The only viable solution I
>>> could find was to write a custom sink which understands S3. Each writer
>>> will write file locally and then copy it to S3 on checkpoint. By only
>>> interacting with S3 once per file it can circumvent consistency issues all
>>> together.
>>>
>>>
>>>
>>> Hope this helps,
>>>
>>>
>>>
>>> Seth Wiesman
>>>
>>>
>>>
>>> *From: *vinay patil <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>>> *Reply-To: *"[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>>> *To: *"[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>>
>>>
>>>
>>> HI Stephan,
>>>
>>> Just to avoid the confusion here, I am using S3 sink for writing the
>>> data, and using HDFS for storing checkpoints.
>>>
>>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>>
>>>
>>> I replaced s3 sink with HDFS for writing data in my last test.
>>>
>>> Let's say the checkpoint interval is 5 minutes, now within 5minutes of
>>> run the state size grows to 30GB ,  after checkpointing the 30GB state that
>>> is maintained in rocksDB has to be copied to HDFS, right ?  is this causing
>>> the pipeline to stall ?
>>>
>>>
>>> Regards,
>>>
>>> Vinay Patil
>>>
>>>
>>>
>>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>>>
>>> Hi Stephan,
>>>
>>> To verify if S3 is making teh pipeline stall, I have replaced the S3
>>> sink with HDFS and kept minimum pause between checkpoints to 5minutes,
>>> still I see the same issue with checkpoints getting failed.
>>>
>>> If I keep the  pause time to 20 seconds, all checkpoints are completed ,
>>> however there is a hit in overall throughput.
>>>
>>>
>>>
>>>
>>> Regards,
>>>
>>> Vinay Patil
>>>
>>>
>>>
>>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User
>>> Mailing List archive.] <[hidden email]> wrote:
>>>
>>> Flink's state backends currently do a good number of "make sure this
>>> exists" operations on the file systems. Through Hadoop's S3 filesystem,
>>> that translates to S3 bucket list operations, where there is a limit in how
>>> many operation may happen per time interval. After that, S3 blocks.
>>>
>>>
>>>
>>> It seems that operations that are totally cheap on HDFS are hellishly
>>> expensive (and limited) on S3. It may be that you are affected by that.
>>>
>>>
>>>
>>> We are gradually trying to improve the behavior there and be more S3
>>> aware.
>>>
>>>
>>>
>>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>>>
>>>
>>>
>>> Best,
>>>
>>> Stephan
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>>
>>> Hi Stephan,
>>>
>>> So do you mean that S3 is causing the stall , as I have mentioned in my
>>> previous mail, I could not see any progress for 16minutes as checkpoints
>>> were getting failed continuously.
>>>
>>>
>>>
>>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing
>>> List archive.]" <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>>
>>> Hi Vinay!
>>>
>>>
>>>
>>> True, the operator state (like Kafka) is currently not asynchronously
>>> checkpointed.
>>>
>>>
>>>
>>> While it is rather small state, we have seen before that on S3 it can
>>> cause trouble, because S3 frequently stalls uploads of even data amounts as
>>> low as kilobytes due to its throttling policies.
>>>
>>>
>>>
>>> That would be a super important fix to add!
>>>
>>>
>>>
>>> Best,
>>>
>>> Stephan
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>>
>>> Hi,
>>>
>>> I have attached a snapshot for reference:
>>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and
>>> 3 it
>>> is stuck at the Kafka source after 50%
>>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 is
>>> 15GB )
>>>
>>> Within 10minutes 15M records were processed, and for the next 16minutes
>>> the
>>> pipeline is stuck , I don't see any progress beyond 15M because of
>>> checkpoints getting failed consistently.
>>>
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.na
>>> bble.com/file/n11882/Checkpointing_Failed.png>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-flink-user-maili
>>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-
>>> RocksDB-as-statebackend-tp11752p11882.html
>>>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> *If you reply to this email, your message will be added to the
>>> discussion below:*
>>>
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11885.html
>>>
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>> ------------------------------
>>>
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> *If you reply to this email, your message will be added to the
>>> discussion below:*
>>>
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11891.html
>>>
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> *If you reply to this email, your message will be added to the
>>> discussion below:*
>>>
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11943.html
>>>
>>> To start a new topic under Apache Flink User Mailing List archive.,
>>> email [hidden email]
>>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>>> NAML
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> View this message in context: Re: Checkpointing with RocksDB as
>>> statebackend
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>>> at Nabble.com.
>>>
>>>
>>
>

Re: Checkpointing with RocksDB as statebackend

Posted by Stephan Ewen <se...@apache.org>.

Hi Vinay!

We just discovered a bug in RocksDB. The bug affects windows without
reduce() or fold(), windows with evictors, and ListState.

A certain access pattern in RocksDB starts being so slow after a certain
size-per-key that it basically brings down the streaming program and the
snapshots.

We are reaching out to the RocksDB folks and looking for workarounds in
Flink.

Greetings,
Stephan


On Wed, Mar 1, 2017 at 12:10 PM, Stephan Ewen <se...@apache.org> wrote:

> @vinay  Can you try to not set the buffer timeout at all? I am actually
> not sure what would be the effect of setting it to a negative value, that
> can be a cause of problems...
>
>
> On Mon, Feb 27, 2017 at 7:44 PM, Seth Wiesman <sw...@mediamath.com>
> wrote:
>
>> Vinay,
>>
>>
>>
>> The bucketing sink performs rename operations during the checkpoint and
>> if it tries to rename a file that is not yet consistent that would cause a
>> FileNotFound exception which would fail the checkpoint.
>>
>>
>>
>> Stephan,
>>
>>
>>
>> Currently my aws fork contains some very specific assumptions about the
>> pipeline that will in general only hold for my pipeline. This is because
>> there were still some open questions that  I had about how to solve
>> consistency issues in the general case. I will comment on the Jira issue
>> with more specific.
>>
>>
>>
>> Seth Wiesman
>>
>>
>>
>> *From: *vinay patil <vi...@gmail.com>
>> *Reply-To: *"user@flink.apache.org" <us...@flink.apache.org>
>> *Date: *Monday, February 27, 2017 at 1:05 PM
>> *To: *"user@flink.apache.org" <us...@flink.apache.org>
>>
>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>
>>
>>
>> Hi Seth,
>>
>> Thank you for your suggestion.
>>
>> But if the issue is only related to S3, then why does this happen when I
>> replace the S3 sink  to HDFS as well (for checkpointing I am using HDFS
>> only )
>>
>> Stephan,
>>
>> Another issue I see is when I set env.setBufferTimeout(-1) , and keep the
>> checkpoint interval to 10minutes, I have observed that nothing gets written
>> to sink (tried with S3 as well as HDFS), atleast I was expecting pending
>> files here.
>>
>> This issue gets worst when checkpointing is disabled  as nothing is
>> written.
>>
>>
>>
>>
>> Regards,
>>
>> Vinay Patil
>>
>>
>>
>> On Mon, Feb 27, 2017 at 10:55 PM, Stephan Ewen [via Apache Flink User
>> Mailing List archive.] <[hidden email]> wrote:
>>
>> Hi Seth!
>>
>>
>>
>> Wow, that is an awesome approach.
>>
>>
>>
>> We have actually seen these issues as well and we are looking to
>> eventually implement our own S3 file system (and circumvent Hadoop's S3
>> connector that Flink currently relies on): https://issues.apache.org
>> /jira/browse/FLINK-5706
>>
>>
>>
>> Do you think your patch would be a good starting point for that and would
>> you be willing to share it?
>>
>>
>>
>> The Amazon AWS SDK for Java is Apache 2 licensed, so that is possible to
>> fork officially, if necessary...
>>
>>
>>
>> Greetings,
>>
>> Stephan
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Feb 27, 2017 at 5:15 PM, Seth Wiesman <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11943&i=0>> wrote:
>>
>> Just wanted to throw in my 2cts.
>>
>>
>>
>> I’ve been running pipelines with similar state size using rocksdb which
>> externalize to S3 and bucket to S3. I was getting stalls like this and
>> ended up tracing the problem to S3 and the bucketing sink. The solution was
>> two fold:
>>
>>
>>
>> 1)       I forked hadoop-aws and have it treat flink as a source of
>> truth. Emr uses a dynamodb table to determine if S3 is inconsistent.
>> Instead I say that if flink believes that a file exists on S3 and we don’t
>> see it then I am going to trust that flink is in a consistent state and S3
>> is not. In this case, various operations will perform a back off and retry
>> up to a certain number of times.
>>
>>
>>
>> 2)       The bucketing sink performs multiple renames over the lifetime
>> of a file, occurring when a checkpoint starts and then again on
>> notification after it completes. Due to S3’s consistency guarantees the
>> second rename of file can never be assured to work and will eventually fail
>> either during or after a checkpoint. Because there is no upper bound on the
>> time it will take for a file on S3 to become consistent, retries cannot
>> solve this specific problem as it could take upwards of many minutes to
>> rename which would stall the entire pipeline. The only viable solution I
>> could find was to write a custom sink which understands S3. Each writer
>> will write file locally and then copy it to S3 on checkpoint. By only
>> interacting with S3 once per file it can circumvent consistency issues all
>> together.
>>
>>
>>
>> Hope this helps,
>>
>>
>>
>> Seth Wiesman
>>
>>
>>
>> *From: *vinay patil <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11943&i=1>>
>> *Reply-To: *"[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11943&i=2>" <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11943&i=3>>
>> *Date: *Saturday, February 25, 2017 at 10:50 AM
>> *To: *"[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11943&i=4>" <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11943&i=5>>
>> *Subject: *Re: Checkpointing with RocksDB as statebackend
>>
>>
>>
>> HI Stephan,
>>
>> Just to avoid the confusion here, I am using S3 sink for writing the
>> data, and using HDFS for storing checkpoints.
>>
>> There are 2 core nodes (HDFS) and two task nodes on EMR
>>
>>
>> I replaced s3 sink with HDFS for writing data in my last test.
>>
>> Let's say the checkpoint interval is 5 minutes, now within 5minutes of
>> run the state size grows to 30GB ,  after checkpointing the 30GB state that
>> is maintained in rocksDB has to be copied to HDFS, right ?  is this causing
>> the pipeline to stall ?
>>
>>
>> Regards,
>>
>> Vinay Patil
>>
>>
>>
>> On Sat, Feb 25, 2017 at 12:22 AM, Vinay Patil <[hidden email]> wrote:
>>
>> Hi Stephan,
>>
>> To verify if S3 is making teh pipeline stall, I have replaced the S3 sink
>> with HDFS and kept minimum pause between checkpoints to 5minutes, still I
>> see the same issue with checkpoints getting failed.
>>
>> If I keep the  pause time to 20 seconds, all checkpoints are completed ,
>> however there is a hit in overall throughput.
>>
>>
>>
>>
>> Regards,
>>
>> Vinay Patil
>>
>>
>>
>> On Fri, Feb 24, 2017 at 10:09 PM, Stephan Ewen [via Apache Flink User
>> Mailing List archive.] <[hidden email]> wrote:
>>
>> Flink's state backends currently do a good number of "make sure this
>> exists" operations on the file systems. Through Hadoop's S3 filesystem,
>> that translates to S3 bucket list operations, where there is a limit in how
>> many operation may happen per time interval. After that, S3 blocks.
>>
>>
>>
>> It seems that operations that are totally cheap on HDFS are hellishly
>> expensive (and limited) on S3. It may be that you are affected by that.
>>
>>
>>
>> We are gradually trying to improve the behavior there and be more S3
>> aware.
>>
>>
>>
>> Both 1.3-SNAPSHOT and 1.2-SNAPSHOT already contain improvements there.
>>
>>
>>
>> Best,
>>
>> Stephan
>>
>>
>>
>>
>>
>> On Fri, Feb 24, 2017 at 4:42 PM, vinay patil <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11891&i=0>> wrote:
>>
>> Hi Stephan,
>>
>> So do you mean that S3 is causing the stall , as I have mentioned in my
>> previous mail, I could not see any progress for 16minutes as checkpoints
>> were getting failed continuously.
>>
>>
>>
>> On Feb 24, 2017 8:30 PM, "Stephan Ewen [via Apache Flink User Mailing
>> List archive.]" <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11887&i=0>> wrote:
>>
>> Hi Vinay!
>>
>>
>>
>> True, the operator state (like Kafka) is currently not asynchronously
>> checkpointed.
>>
>>
>>
>> While it is rather small state, we have seen before that on S3 it can
>> cause trouble, because S3 frequently stalls uploads of even data amounts as
>> low as kilobytes due to its throttling policies.
>>
>>
>>
>> That would be a super important fix to add!
>>
>>
>>
>> Best,
>>
>> Stephan
>>
>>
>>
>>
>>
>> On Fri, Feb 24, 2017 at 2:58 PM, vinay patil <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=11885&i=0>> wrote:
>>
>> Hi,
>>
>> I have attached a snapshot for reference:
>> As you can see all the 3 checkpointins failed , for checkpoint ID 2 and 3
>> it
>> is stuck at the Kafka source after 50%
>> (The data sent till now by Kafka source 1 is 65GB and sent by source 2 is
>> 15GB )
>>
>> Within 10minutes 15M records were processed, and for the next 16minutes
>> the
>> pipeline is stuck , I don't see any progress beyond 15M because of
>> checkpoints getting failed consistently.
>>
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/file/n11882/Checkpointing_Failed.png>
>>
>>
>>
>> --
>> View this message in context: http://apache-flink-user-maili
>> ng-list-archive.2336050.n4.nabble.com/Re-Checkpointing-
>> with-RocksDB-as-statebackend-tp11752p11882.html
>>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>>
>>
>>
>> ------------------------------
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p11885.html
>>
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email] <http://user/SendEmail.jtp?type=node&node=11887&i=1>
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ------------------------------
>>
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11887.html>
>>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>>
>>
>>
>> ------------------------------
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p11891.html
>>
>> To start a new topic under Apache Flink User Mailing List archive., email
>> [hidden email]
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11913.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>>
>>
>>
>> ------------------------------
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-
>> tp11752p11943.html
>>
>> To start a new topic under Apache Flink User Mailing List archive., email [hidden
>> email]
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>> ------------------------------
>>
>> View this message in context: Re: Checkpointing with RocksDB as
>> statebackend
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11949.html>
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
>> at Nabble.com.
>>
>>
>