You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Timo Walther <tw...@apache.org> on 2017/10/02 10:07:56 UTC

Re: Enriching data from external source with cache

Hi Derek,

maybe the following talk can inspire you, how to do this with joins and 
async IO: https://www.youtube.com/watch?v=Do7C4UJyWCM (around the 17th 
min). Basically, you split the stream and wait for an Async IO result in 
a downstream operator.

But I think having a transient guava cache is not a bad idea, since it 
is only a cache it does not need to be checkpointed and can be recovered 
at any time.

Implementing you own logic in a ProcessFunction is always a way, but 
might require more implementation effort.

Btw. if you feel brave enough, you could also think of contributing a 
stateful async IO. It should not be too much effort to make this work.

Regards,
Timo



Am 9/29/17 um 8:39 PM schrieb Derek VerLee:
> My basic problem will sound familiar I think, I need to enrich 
> incoming data using a REST call to an external system for slowly 
> evolving metadata. and some cache based lag is acceptable, so to 
> reduce load on the external system and to process more efficiently, I 
> would like to implement a cache.  The cache would by key, and I am 
> already doing a keyBy for the same key in the job.
>
> Please correct me if I'm wrong:
> * Keyed State would be great to store my metadata "cache", Async I/O 
> is ideal for pulling from the external system,
> but AsyncFunction can not access keyed state ( "Exception: State is 
> not supported in rich async functions.") and operators can not share 
> state between them.
>
> This leaves me wondering, since side inputs are not here yet, what the 
> best (and perhaps most idiomatic) way to approach my problem?
>
> I'd rather keep changes to existing systems minimal for this iteration 
> and just minimize impact on them during peaks best I can... systemic 
> refactoring and re-architecture will be coming soon (so I'm happy to 
> hear thoughts on that as well).
>
> Approaches considered:
>
> 1. AsyncFunction with a transient guava cache.  Not ideal ... but 
> maybe good enough to get by
> 2. Using compound message types (oh, if only java had real algebraic 
> data types...) and send cache miss messages from some 
> CacheEnrichmentMapper (keyed) to some AsyncCacheLoader (not keyed) 
> which then backfeeds cache updates to the former via iteration ... i 
> don't know why this couldn't work but it feels like a hot mess unless 
> there is some way I am not thinking of to do it cleanly
> 3. One user mentioned on a similar thread loading the data in as 
> another DataStream and then using joins, but I'm confused about how 
> this would work, it seems to me that joins happen on windows, windows 
> pertain to (some notion of) time, what would be my notion of time for 
> the slow (maybe years old in some cases) meta-data?
> 4. Forget about async I/O
> 5. implement my own "async i/o" in using a process function or 
> similar  .. is this a valid pattern