You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2019/10/01 15:22:04 UTC

Re: Best way to link static data to event data?

Or advantages/disadvantages?

On Mon., Sep. 30, 2019, 3:12 p.m. John Smith, <ja...@gmail.com>
wrote:

> Hi so here is what I have done...
>
> 1- I load my CSV using CSV table source.
> 2- 1 setup Kafka stream to read my incoming events.
> 3- Map my events to a POJO
> 4- Join the 2 tables
> 5- Push the joined result to Elastic search.
>
> This works absolutely fine. So whats the difference between this and the
> proposed solutions above?
>
>
> On Mon, 30 Sep 2019 at 13:35, John Smith <ja...@gmail.com> wrote:
>
>> Ok thanks. It's basically telephone area codes, they barely ever change.
>>
>> On Mon, 30 Sep 2019 at 06:03, Gaël Renoux <ga...@datadome.co>
>> wrote:
>>
>>> Hi John,
>>>
>>> I've had a similar requirement, and I've resorted to simply use a static
>>> cache (I'm coding in Scala, so that's a lazy value on a singleton object -
>>> in Java that would be a static value on some utility class, with a
>>> synchronized lazy-loading getter). The value is reloaded after some
>>> duration, which adds a small latency at regular intervals. Keep in mind
>>> that one instance of that value will be loaded on each task manager
>>> (provided that at least one task running on that task manager calls the
>>> getter).
>>>
>>> If you're OK with restarting the job when your data changes, it would be
>>> better to load it on start (no need to synchronize stuff). Just load it
>>> inside your job initialization code (it will be executed within the job
>>> manager) and pass that data as a parameter to your operator's constructor.
>>> The data format must be serializable.
>>>
>>> Gaël
>>>
>>>
>>> On Sat, Sep 28, 2019 at 2:18 AM Sameer Wadkar <sa...@axiomine.com>
>>> wrote:
>>>
>>>> The main consideration in these type of scenarios is not the type of
>>>> source function you use. The key point is how does the event operator get
>>>> the slow moving master data and cache it. And then recover it if it fails
>>>> and restarts again.
>>>>
>>>> It does not matter that the csv file does not change often. It is
>>>> possible that the event operator may fail and restart. The csv data needs
>>>> to made available to it again.
>>>>
>>>> In that scenario the initial suggestion I made to pass the csv data in
>>>> the constructor is not adequate by itself. You need to store it in the
>>>> operator state which allows it to recover it when it restarts  on failure.
>>>>
>>>> As long as the above takes place you have resiliency and you can use
>>>> any suitable method or source. I have not used Table source as much but
>>>> connected streams and operator state has worked out for me in similar
>>>> scenarios.
>>>>
>>>> Sameer
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Sep 27, 2019, at 4:38 PM, John Smith <ja...@gmail.com> wrote:
>>>>
>>>> It's a fairly small static file that may update once in a blue moon lol
>>>> But I'm hopping to use existing functions. Why can't I just use CSV to
>>>> table source?
>>>>
>>>> Why should I have to now either write my own CSV parser or look for 3rd
>>>> party, then what put in a Java Map and lookup that map? I'm finding Flink
>>>> to be a bit of death by 1000 paper cuts lol
>>>>
>>>> if i put the CSV in a table I can then use it to join across it with
>>>> the event no?
>>>>
>>>> On Fri, 27 Sep 2019 at 16:25, Sameer W <sa...@axiomine.com> wrote:
>>>>
>>>>> Connected Streams is one option. But may be an overkill in your
>>>>> scenario if your CSV does not refresh. If your CSV is small enough (number
>>>>> of records wise), you could parse it and load it into an object
>>>>> (serializable) and pass it to the constructor of the operator where you
>>>>> will be streaming the data.
>>>>>
>>>>> If the CSV can be made available via a shared network folder (or S3 in
>>>>> case of AWS) you could also read it in the open function (if you use Rich
>>>>> versions of the operator).
>>>>>
>>>>> The real problem I guess is how frequently does the CSV update. If you
>>>>> want the updates to propagate in near real time (or on schedule) the option
>>>>> 1  ( parse in driver and send it via constructor does not work). Also in
>>>>> the second option you need to be responsible for refreshing the file read
>>>>> from the shared folder.
>>>>>
>>>>> In that case use Connected Streams where the stream reading in the
>>>>> file (the other stream reads the events) periodically re-reads the file and
>>>>> sends it down the stream. The refresh interval is your tolerance of stale
>>>>> data in the CSV.
>>>>>
>>>>> On Fri, Sep 27, 2019 at 3:49 PM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I don't think I need state for this...
>>>>>>
>>>>>> I need to load a CSV. I'm guessing as a table and then filter my
>>>>>> events parse the number, transform the event into geolocation data and sink
>>>>>> that downstream data source.
>>>>>>
>>>>>> So I'm guessing i need a CSV source and my Kafka source and somehow
>>>>>> join those transform the event...
>>>>>>
>>>>>> On Fri, 27 Sep 2019 at 14:43, Oytun Tez <oy...@motaword.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> You should look broadcast state pattern in Flink docs.
>>>>>>>
>>>>>>> ---
>>>>>>> Oytun Tez
>>>>>>>
>>>>>>> *M O T A W O R D*
>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>> oytun@motaword.com — www.motaword.com
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 27, 2019 at 2:42 PM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Using 1.8
>>>>>>>>
>>>>>>>> I have a list of phone area codes, cities and their geo location in
>>>>>>>> CSV file. And my events from Kafka contain phone numbers.
>>>>>>>>
>>>>>>>> I want to parse the phone number get it's area code and then
>>>>>>>> associate the phone number to a city, geo location and as well count how
>>>>>>>> many numbers are in that city/geo location.
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Gaël Renoux
>>> Senior R&D Engineer, DataDome
>>> M +33 6 76 89 16 52  <+33+6+76+89+16+52>
>>> E gael.renoux@datadome.co  <ga...@datadome.co>
>>> W www.datadome.co
>>> <http://www.datadome.co?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
>>>
>>>
>>> <https://www.facebook.com/datadome/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
>>> <https://fr.linkedin.com/company/datadome?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
>>> <https://twitter.com/data_dome?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
>>> [image: Read DataDome reviews on G2]
>>> <https://www.g2.com/products/datadome/reviews?utm_source=review-widget&utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
>>>
>>