You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Luke Cwik <lc...@google.com> on 2022/04/04 21:46:36 UTC
Re: Need help with designing a beam pipeline for data enrichment

What order of magnitude of keys do we have in total, millions, billions?
How much data do you need to store per key?
Are point queries to the data store efficient?
Does the data store support batch queries efficiently (e.g. select * from
table where key in (...))?
What percentage of keys are unique each day that weren't processed in the
past day, 3 days, 7 days, ...?

My first thought would be to have a streaming pipeline with a stateful DoFn
storing what you need per key but this is really dependent on having more
details about the pipeline.



On Thu, Mar 31, 2022 at 10:25 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> Hi Johannes,
>
> Agree, it looks like a general data processing problem when you need to
> enrich a small dataset with a data from a large one. There can be the
> different solutions depending on your environment and available resources.
>
> At the first glance, your proposal sounds reasonable for me - so it’s
> mostly a question of how to optimise the number of requests to fetch the
> required processed entries. In this case, if you have some shard
> information, you can use it to group your input entries into bundles and do
> a request for that bundle instead of for every input record and to
> different shards.
>
> Additionally, you may consider to use a Bloom filter to pre-check if a
> needed record was already processed. The advantage of this is that the size
> of that data structure is quite small even for a large number of keys and
> can be easily loaded before. Though, obviously, you would nee to update it
> with every batch run.
>
> What kind of storage do use to keep already processed data? Does it
> provide O(1) access to specific record or group of records?
> What are the approximate sizes of your datasets?
>
> —
> Alexey
>
> > On 31 Mar 2022, at 11:57, Johannes Frey <jf...@data-maniacs.ai> wrote:
> >
> > Hi Everybody,
> >
> > I'm currently facing an issue where I'm not sure how to design it
> > using apache beam.
> > I'm batch processing data, it's about 300k entries per day. After
> > doing some aggregations the results are about 60k entries.
> >
> > The issue that I'm facing now is that the entries from that batch may
> > be related to entries already processed at some time in the past and
> > if they are, I would need to fetch the already processed record from
> > the past and merge it with the new record.
> >
> > To make matters worse the "window" of that relationship might be
> > several years, so I can't just sideload the last few days worth of
> > data and catch all the relationships, I would need to on each batch
> > run load all the already processed entries which seems not to be a
> > good idea ;-)
> >
> > I also think that issuing 60k queries to always fetch the relevant
> > related entry from the past for each new entry is a good idea. I could
> > try to "window" it tho and group them by let's say 100 entries and
> > fire a query to fetch the 100 old entries for the current 100
> > processed entries... that would at least reduce the amount of queries
> > by 60k/100.
> >
> > Are there any other good ways to solve issues like that? I would
> > imagine that those situations should be quite common. Maybe there are
> > some best practices around this issue.
> >
> > It's basically enriching already processed entries with information
> > from new entries.
> >
> > Would be great if someone could point me in the right direction or
> > give me some more keywords that I can google.
> >
> > Thanks and regards
> > Jo
>
>