You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Timo Walther <tw...@apache.org> on 2017/10/02 10:07:56 UTC
Re: Enriching data from external source with cache
Hi Derek,
maybe the following talk can inspire you, how to do this with joins and
async IO: https://www.youtube.com/watch?v=Do7C4UJyWCM (around the 17th
min). Basically, you split the stream and wait for an Async IO result in
a downstream operator.
But I think having a transient guava cache is not a bad idea, since it
is only a cache it does not need to be checkpointed and can be recovered
at any time.
Implementing you own logic in a ProcessFunction is always a way, but
might require more implementation effort.
Btw. if you feel brave enough, you could also think of contributing a
stateful async IO. It should not be too much effort to make this work.
Regards,
Timo
Am 9/29/17 um 8:39 PM schrieb Derek VerLee:
> My basic problem will sound familiar I think, I need to enrich
> incoming data using a REST call to an external system for slowly
> evolving metadata. and some cache based lag is acceptable, so to
> reduce load on the external system and to process more efficiently, I
> would like to implement a cache. The cache would by key, and I am
> already doing a keyBy for the same key in the job.
>
> Please correct me if I'm wrong:
> * Keyed State would be great to store my metadata "cache", Async I/O
> is ideal for pulling from the external system,
> but AsyncFunction can not access keyed state ( "Exception: State is
> not supported in rich async functions.") and operators can not share
> state between them.
>
> This leaves me wondering, since side inputs are not here yet, what the
> best (and perhaps most idiomatic) way to approach my problem?
>
> I'd rather keep changes to existing systems minimal for this iteration
> and just minimize impact on them during peaks best I can... systemic
> refactoring and re-architecture will be coming soon (so I'm happy to
> hear thoughts on that as well).
>
> Approaches considered:
>
> 1. AsyncFunction with a transient guava cache. Not ideal ... but
> maybe good enough to get by
> 2. Using compound message types (oh, if only java had real algebraic
> data types...) and send cache miss messages from some
> CacheEnrichmentMapper (keyed) to some AsyncCacheLoader (not keyed)
> which then backfeeds cache updates to the former via iteration ... i
> don't know why this couldn't work but it feels like a hot mess unless
> there is some way I am not thinking of to do it cleanly
> 3. One user mentioned on a similar thread loading the data in as
> another DataStream and then using joins, but I'm confused about how
> this would work, it seems to me that joins happen on windows, windows
> pertain to (some notion of) time, what would be my notion of time for
> the slow (maybe years old in some cases) meta-data?
> 4. Forget about async I/O
> 5. implement my own "async i/o" in using a process function or
> similar .. is this a valid pattern