You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Marchant, Hayden " <ha...@citi.com> on 2017/10/02 10:46:08 UTC

In-memory cache

We have an operator in our streaming application that needs to access 'reference data' that is updated by another Flink streaming application. This reference data has about ~10,000 entries and has a small footprint. This reference data needs to be updated ~ every 100 ms. The required latency for this application is extremely low ( a couple of milliseconds), and we are therefore cautious of paying cost of I/O to access the reference data remotely. We are currently examining 3 different options for accessing this reference data:

1. Expose the reference data as QueryableState and access it directly from the 'client' streaming operator using the QueryableState API
2. same as #1, but create an In-memory Java cache of the reference data within the operator that is asynchronously updated at a scheduled frequency using the QueryableState API
3. Output the reference data to Redis, and create an in-memory java cache of the reference data within the operator that is asynchronously updated at a scheduled frequency using Redis API.

My understanding is that one of the cons of using Queryable state, is that if the Flink application that generates the reference data is unavailable, the Queryable state will not exist - is that correct?

If we were to use an asynchronously scheduled 'read' from the distributed cache, where should it be done? I was thinking of using ScheduledExecutorService from within the open method of the Flink operator.

What is the best way to get this done?

Regards,
Hayden Marchant

RE: In-memory cache

Posted by "Marchant, Hayden " <ha...@citi.com>.

Nice idea. Actually we are looking at connect for other parts of our solution in which the latency is less critical.

A few considerations of not using ‘connect’ in this case were:


1.       To isolate the two streams from each other to reduce complexity, simplify debugging etc…. – since we are newbies at Flink I was thinking that it is beneficial to keep the stream as simple as possible, and if need be, we can interface between them to ‘exchange data’

2.       The reference data, even though quite small, is updated every 100ms. Since we would need this reference data on each ‘consuming’ operator instance, we would be essentially nearly double the amount of tuples coming through the operator. Since low-latency is  key here, this was a concern, the assumption being that the two sides of the ‘connect’ share the same resources – whereas using a background thread to update a ‘map’ would not be competing with the incoming tuples)

I realize that structurally, connect is a neater solution.

If I can be convinced that my above concerns are unfounded, I’ll be happy to try that direction.

Thanks
Hayden

From: Stavros Kontopoulos [mailto:st.kontopoulos@gmail.com]
Sent: Monday, October 02, 2017 2:24 PM
To: Marchant, Hayden [ICG-IT]
Cc: user@flink.apache.org
Subject: Re: In-memory cache

How about connecting two streams of data, one from the reference data and one from the main data (I assume using key streams as you mention QueryableState) and keep state locally within the operator.
The idea is to have a local sub-copy of the reference data within the operator that is updated from the source of the reference data. Reference data are still updated
externally from that low latency flink app. Here is a relevant question: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Accessing-state-in-connected-streams-td8727.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_Accessing-2Dstate-2Din-2Dconnected-2Dstreams-2Dtd8727.html&d=DwMFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=eLl5xx2Dc8nmmad2mz2k0aQ53NeI_Fb2V-qeRn-7CVQ&s=05n6tNvhZgLQ4o_N9tpkh8jhM1RcyCB_MIVcZILECtI&e=>. Would that help?

Stavros



On Mon, Oct 2, 2017 at 1:46 PM, Marchant, Hayden <ha...@citi.com>> wrote:
We have an operator in our streaming application that needs to access 'reference data' that is updated by another Flink streaming application. This reference data has about ~10,000 entries and has a small footprint. This reference data needs to be updated ~ every 100 ms. The required latency for  this application is extremely low ( a couple of milliseconds), and we are therefore cautious of paying cost of I/O to access the reference data remotely. We are currently examining 3 different options for accessing this reference data:

1. Expose the reference data as QueryableState and access it directly from the 'client' streaming operator using the QueryableState API
2. same as #1, but create an In-memory Java cache of the reference data within the operator that is asynchronously updated at a scheduled frequency using the QueryableState API
3. Output the reference data to Redis, and create an in-memory java cache of the reference data within the operator that is asynchronously updated at a scheduled frequency using Redis API.

My understanding is that one of the cons of using Queryable state, is that if the Flink application that generates the reference data is unavailable, the Queryable state will not exist - is that correct?

If we were to use an asynchronously scheduled 'read' from the distributed cache, where should it be done? I was thinking of using ScheduledExecutorService from within the open method of the Flink operator.

What is the best way to get this done?

Regards,
Hayden Marchant

Re: In-memory cache

Posted by Stavros Kontopoulos <st...@gmail.com>.

How about connecting two streams of data, one from the reference data and
one from the main data (I assume using key streams as you mention
QueryableState) and keep state locally within the operator.
The idea is to have a local sub-copy of the reference data within the
operator that is updated from the source of the reference data. Reference
data are still updated
externally from that low latency flink app. Here is a relevant question:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Accessing-state-in-connected-streams-td8727.html.
Would that help?

Stavros



On Mon, Oct 2, 2017 at 1:46 PM, Marchant, Hayden <ha...@citi.com>
wrote:

> We have an operator in our streaming application that needs to access
> 'reference data' that is updated by another Flink streaming application.
> This reference data has about ~10,000 entries and has a small footprint.
> This reference data needs to be updated ~ every 100 ms. The required
> latency for  this application is extremely low ( a couple of milliseconds),
> and we are therefore cautious of paying cost of I/O to access the reference
> data remotely. We are currently examining 3 different options for accessing
> this reference data:
>
> 1. Expose the reference data as QueryableState and access it directly from
> the 'client' streaming operator using the QueryableState API
> 2. same as #1, but create an In-memory Java cache of the reference data
> within the operator that is asynchronously updated at a scheduled frequency
> using the QueryableState API
> 3. Output the reference data to Redis, and create an in-memory java cache
> of the reference data within the operator that is asynchronously updated at
> a scheduled frequency using Redis API.
>
> My understanding is that one of the cons of using Queryable state, is that
> if the Flink application that generates the reference data is unavailable,
> the Queryable state will not exist - is that correct?
>
> If we were to use an asynchronously scheduled 'read' from the distributed
> cache, where should it be done? I was thinking of using
> ScheduledExecutorService from within the open method of the Flink operator.
>
> What is the best way to get this done?
>
> Regards,
> Hayden Marchant
>
>