You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@datasketches.apache.org by Will Lauer <wl...@verizonmedia.com> on 2021/04/08 15:55:54 UTC

Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Last time I looked at the Flink API for implementing aggregators, it looked
like it required a "decrement" function to remove entries from the
aggregate in addition to the standard "aggregate" function to add entries
to the aggregate. The documentation was unclear, but it looked like this
was a requirement for getting streaming, windowed operations to work. I
don't know if you are going to need that functionality for your Kappa
architecture, but you should double check, as implementing such
functionality with sketches may be impossible, and might pose a blocker for
you.

It's possible that this has been resolved now. Its equally possible that I
completely misunderstood the flink doc, but its something that jumped out
at me and made be very nervous about reimplementing our current PIG based
pipelinein Flink due to our use of sketches.

Will


<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Thu, Apr 8, 2021 at 3:43 AM Alex Garland <ag...@expediagroup.com>
wrote:

> Thanks all very much for the responses so far. Definitely useful but I
> think it might help to narrow focus if I explain a little more context of
> what we are trying to do.
>
>
>
> Firstly, we want to emit the profile metrics as a stream (Kafka topic) as
> well, which I assume would mean we wouldn’t want to use Druid (which is
> more in the spirit of a next-gen/ low-latency analytics DB if I understand
> correctly?)
>
>
>
> We are definitely interested in Flink as it looks like this may be a good
> route to create a Kappa architecture with a single set of code handling
> profiling of batch and stream data sets. Appreciate some of the following
> may be a bit more about Flink than DataSketches per se but will post for
> the record.
>
>
>
> I started looking at the Table/ SQL API as this seems to be something that
> is being encouraged for Kappa use cases. It looked like the required
> interface for user-defined aggregate functions in Flink SQL should allow
> wrapping of the Sketch objects as accumulators, but when we tried this in
> practice we got issues – Flink can’t extract a data type for CpCSketch, at
> least partly due to it having private fields (i.e. seed).
>
>
>
> We’re next looking at whether this is easier using the DataStreams API, if
> anyone can confirm the following it would be useful:
>
>    - Would I be right in thinking that where other people have integrated
>    Flink and DataSketches it has been using DataStreams API?
>    - Are there any good code examples publicly available (GitHub?) that
>    might help steer/ validate our approach?
>
>
>
> In the longer term (later this year), one option we might consider is
> creating an OSS configurable library/ framework for running checks based on
> DataSketches in Flink (we also need to see whether for example Bullet
> already covers a lot of what we need in terms of setting up stream
> queries). If anyone else feels there is a gap and might be interested in
> collaborating, please let me know and I can publish more details of what
> we’re proposing if and when that evolves.
>
>
>
> Many thanks
>
>
>
>
>
> *From: *Marko Mušnjak <ma...@gmail.com>
> *Date: *Tuesday, 6 April 2021 at 20:21
> *To: *users@datasketches.apache.org <us...@datasketches.apache.org>
> *Subject: *[External] Re: Choice of Flink vs Spark for using DataSketches
> with streaming data
>
> Hi,
>
> I've implemented jobs using datasketches in Kafka Streams, Flink
> streaming, and in Spark batch (through the Hive UDFs provided). Things went
> smoothly in all setups, with the gotcha that hive UDFs represent incoming
> strings as utf-8 byte arrays (or something like that, i forgot by now), so
> if you're mixing sketches from two sources (Kafka Streams + Spark batch in
> my case) you have to take care to cast the input items to proper types
> before adding them to sketches.
>
>
>
> A mailing list thread concerning that issue:
> http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=>
> (thread continues into September)
>
>
>
> Regards,
>
> Marko
>
>
>
> On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jm...@apache.org> wrote:
>
> I'll echo what Ben said -- if a pre-existing solution does what you need,
> certainly use that.
>
>
>
> Having said that, I want to revisit frequent directions in light of the
> work Charlie did on using it for ridge regression. And when I asked
> internally I was told that Flink is where at least my company seems to be
> going for such jobs. So when I get a chance to dive into that, I'll be
> learning how to do it in Flink.
>
>
>
>   jon
>
>
>
> On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <be...@imply.io> wrote:
>
> I can't answer about Spark or Flink, but as a druid person, I'll put in a
> plug for druid for the "if necessary" case.  It can ingest from kafka and
> aggregate and do sketches during ingestion.  (It's a whole new ballpark,
> though, if you're not already using it.)
>
>
>
> On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <ag...@expediagroup.com>
> wrote:
>
> Hi
>
>
>
> New to DataSketches and looking forward to using, seems like a great
> library.
>
>
>
> My team are evaluating it to profile streaming data (in Kafka) in 5-minute
> windows. The obvious options for stream processing (given experience within
> our org) would be either Flink or Spark Streaming.
>
>
>
> Two questions:
>
>    - Would I be right in thinking that there are not existing
>    integrations as libraries for either of these platforms? Absolutely fine if
>    not, just confirming understanding.
>    - Is there any view (from either the maintainers or the wider
>    community) on whether either of those two are easier to integrate with
>    DataSketches? We would also consider other streaming platforms if
>    necessary, but as mentioned wider usage within the org would lean us
>    against that if at all possible.
>
>
>
> Many thanks
>
>

Re: [External] Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Posted by Alex Garland <ag...@expediagroup.com>.

Thanks Will and Marko

I don’t think we need to decrement/ retract values for any reason, and our requirements were we to use Flink SQL would not currently involve the OVER syntax.

It seems today like we’ve managed to get DataSketches CPC sketch integrated okay with an aggregate function in the vanilla Java (DataStreams) Flink API, which would at least getting things working for streaming use cases if not completely smoothly for Kappa. We will be doing some further testing to confirm that when we properly distribute the process (versus local integration test) then for example serialization doesn’t throw out any surprises.

Given that the above seems to work, I think (still TBC) the issue with using in Flink SQL may be around missing TypeInformation and-or hard constraints on what types of classes can be used as accumulators in the SQL abstraction.

Will try and find time to update this thread further on testing outcomes (and also if we do eventually get the SQL approach working).


From: Marko Mušnjak <ma...@gmail.com>
Date: Thursday, 8 April 2021 at 17:35
To: users@datasketches.apache.org <us...@datasketches.apache.org>
Subject: [External] Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data
The basic streaming windowed aggregations (in the Java/Scala API, https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#aggregatefunction) don't require the retract method, but it looks like the SQL/Table API requires retract support for aggregate functions, if they need to be used in OVER windows: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html#mandatory-and-optional-methods

Marko

On Thu, 8 Apr 2021 at 17:56, Will Lauer <wl...@verizonmedia.com>> wrote:
Last time I looked at the Flink API for implementing aggregators, it looked like it required a "decrement" function to remove entries from the aggregate in addition to the standard "aggregate" function to add entries to the aggregate. The documentation was unclear, but it looked like this was a requirement for getting streaming, windowed operations to work. I don't know if you are going to need that functionality for your Kappa architecture, but you should double check, as implementing such functionality with sketches may be impossible, and might pose a blocker for you.

It's possible that this has been resolved now. Its equally possible that I completely misunderstood the flink doc, but its something that jumped out at me and made be very nervous about reimplementing our current PIG based pipelinein Flink due to our use of sketches.

Will


[https://s.yimg.com/cv/apiv2/default/20190416/verizonmedia_emailsig_400x89.jpg]<http://www.verizonmedia.com/>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822
[http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-facebook?$defaultscale$]<http://www.facebook.com/verizonmedia>  [http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-twitter?$defaultscale$] <http://twitter.com/verizonmedia>   [http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-linkedin?$defaultscale$] <https://www.linkedin.com/company/verizon-media/>   [http://ss7.vzw.com/is/image/VerizonWireless/vz-sig-instagram?$defaultscale$] <http://www.instagram.com/verizonmedia>


On Thu, Apr 8, 2021 at 3:43 AM Alex Garland <ag...@expediagroup.com>> wrote:
Thanks all very much for the responses so far. Definitely useful but I think it might help to narrow focus if I explain a little more context of what we are trying to do.

Firstly, we want to emit the profile metrics as a stream (Kafka topic) as well, which I assume would mean we wouldn’t want to use Druid (which is more in the spirit of a next-gen/ low-latency analytics DB if I understand correctly?)

We are definitely interested in Flink as it looks like this may be a good route to create a Kappa architecture with a single set of code handling profiling of batch and stream data sets. Appreciate some of the following may be a bit more about Flink than DataSketches per se but will post for the record.

I started looking at the Table/ SQL API as this seems to be something that is being encouraged for Kappa use cases. It looked like the required interface for user-defined aggregate functions in Flink SQL should allow wrapping of the Sketch objects as accumulators, but when we tried this in practice we got issues – Flink can’t extract a data type for CpCSketch, at least partly due to it having private fields (i.e. seed).

We’re next looking at whether this is easier using the DataStreams API, if anyone can confirm the following it would be useful:

  *   Would I be right in thinking that where other people have integrated Flink and DataSketches it has been using DataStreams API?
  *   Are there any good code examples publicly available (GitHub?) that might help steer/ validate our approach?

In the longer term (later this year), one option we might consider is creating an OSS configurable library/ framework for running checks based on DataSketches in Flink (we also need to see whether for example Bullet already covers a lot of what we need in terms of setting up stream queries). If anyone else feels there is a gap and might be interested in collaborating, please let me know and I can publish more details of what we’re proposing if and when that evolves.

Many thanks


From: Marko Mušnjak <ma...@gmail.com>>
Date: Tuesday, 6 April 2021 at 20:21
To: users@datasketches.apache.org<ma...@datasketches.apache.org> <us...@datasketches.apache.org>>
Subject: [External] Re: Choice of Flink vs Spark for using DataSketches with streaming data
Hi,
I've implemented jobs using datasketches in Kafka Streams, Flink streaming, and in Spark batch (through the Hive UDFs provided). Things went smoothly in all setups, with the gotcha that hive UDFs represent incoming strings as utf-8 byte arrays (or something like that, i forgot by now), so if you're mixing sketches from two sources (Kafka Streams + Spark batch in my case) you have to take care to cast the input items to proper types before adding them to sketches.

A mailing list thread concerning that issue: http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser<https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=> (thread continues into September)

Regards,
Marko

On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jm...@apache.org>> wrote:
I'll echo what Ben said -- if a pre-existing solution does what you need, certainly use that.

Having said that, I want to revisit frequent directions in light of the work Charlie did on using it for ridge regression. And when I asked internally I was told that Flink is where at least my company seems to be going for such jobs. So when I get a chance to dive into that, I'll be learning how to do it in Flink.

  jon

On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <be...@imply.io>> wrote:
I can't answer about Spark or Flink, but as a druid person, I'll put in a plug for druid for the "if necessary" case.  It can ingest from kafka and aggregate and do sketches during ingestion.  (It's a whole new ballpark, though, if you're not already using it.)

On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <ag...@expediagroup.com>> wrote:
Hi

New to DataSketches and looking forward to using, seems like a great library.

My team are evaluating it to profile streaming data (in Kafka) in 5-minute windows. The obvious options for stream processing (given experience within our org) would be either Flink or Spark Streaming.

Two questions:

  *   Would I be right in thinking that there are not existing integrations as libraries for either of these platforms? Absolutely fine if not, just confirming understanding.
  *   Is there any view (from either the maintainers or the wider community) on whether either of those two are easier to integrate with DataSketches? We would also consider other streaming platforms if necessary, but as mentioned wider usage within the org would lean us against that if at all possible.

Many thanks

Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

Posted by Marko Mušnjak <ma...@gmail.com>.

The basic streaming windowed aggregations (in the Java/Scala API,
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#aggregatefunction)
don't require the retract method, but it looks like the SQL/Table API
requires retract support for aggregate functions, if they need to be used
in OVER windows:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/udfs.html#mandatory-and-optional-methods


Marko

On Thu, 8 Apr 2021 at 17:56, Will Lauer <wl...@verizonmedia.com> wrote:

> Last time I looked at the Flink API for implementing aggregators, it
> looked like it required a "decrement" function to remove entries from the
> aggregate in addition to the standard "aggregate" function to add entries
> to the aggregate. The documentation was unclear, but it looked like this
> was a requirement for getting streaming, windowed operations to work. I
> don't know if you are going to need that functionality for your Kappa
> architecture, but you should double check, as implementing such
> functionality with sketches may be impossible, and might pose a blocker for
> you.
>
> It's possible that this has been resolved now. Its equally possible that I
> completely misunderstood the flink doc, but its something that jumped out
> at me and made be very nervous about reimplementing our current PIG based
> pipelinein Flink due to our use of sketches.
>
> Will
>
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Thu, Apr 8, 2021 at 3:43 AM Alex Garland <ag...@expediagroup.com>
> wrote:
>
>> Thanks all very much for the responses so far. Definitely useful but I
>> think it might help to narrow focus if I explain a little more context of
>> what we are trying to do.
>>
>>
>>
>> Firstly, we want to emit the profile metrics as a stream (Kafka topic) as
>> well, which I assume would mean we wouldn’t want to use Druid (which is
>> more in the spirit of a next-gen/ low-latency analytics DB if I understand
>> correctly?)
>>
>>
>>
>> We are definitely interested in Flink as it looks like this may be a good
>> route to create a Kappa architecture with a single set of code handling
>> profiling of batch and stream data sets. Appreciate some of the following
>> may be a bit more about Flink than DataSketches per se but will post for
>> the record.
>>
>>
>>
>> I started looking at the Table/ SQL API as this seems to be something
>> that is being encouraged for Kappa use cases. It looked like the required
>> interface for user-defined aggregate functions in Flink SQL should allow
>> wrapping of the Sketch objects as accumulators, but when we tried this in
>> practice we got issues – Flink can’t extract a data type for CpCSketch, at
>> least partly due to it having private fields (i.e. seed).
>>
>>
>>
>> We’re next looking at whether this is easier using the DataStreams API,
>> if anyone can confirm the following it would be useful:
>>
>>    - Would I be right in thinking that where other people have
>>    integrated Flink and DataSketches it has been using DataStreams API?
>>    - Are there any good code examples publicly available (GitHub?) that
>>    might help steer/ validate our approach?
>>
>>
>>
>> In the longer term (later this year), one option we might consider is
>> creating an OSS configurable library/ framework for running checks based on
>> DataSketches in Flink (we also need to see whether for example Bullet
>> already covers a lot of what we need in terms of setting up stream
>> queries). If anyone else feels there is a gap and might be interested in
>> collaborating, please let me know and I can publish more details of what
>> we’re proposing if and when that evolves.
>>
>>
>>
>> Many thanks
>>
>>
>>
>>
>>
>> *From: *Marko Mušnjak <ma...@gmail.com>
>> *Date: *Tuesday, 6 April 2021 at 20:21
>> *To: *users@datasketches.apache.org <us...@datasketches.apache.org>
>> *Subject: *[External] Re: Choice of Flink vs Spark for using
>> DataSketches with streaming data
>>
>> Hi,
>>
>> I've implemented jobs using datasketches in Kafka Streams, Flink
>> streaming, and in Spark batch (through the Hive UDFs provided). Things went
>> smoothly in all setups, with the gotcha that hive UDFs represent incoming
>> strings as utf-8 byte arrays (or something like that, i forgot by now), so
>> if you're mixing sketches from two sources (Kafka Streams + Spark batch in
>> my case) you have to take care to cast the input items to proper types
>> before adding them to sketches.
>>
>>
>>
>> A mailing list thread concerning that issue:
>> http://mail-archives.apache.org/mod_mbox/datasketches-users/202008.mbox/browser
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_datasketches-2Dusers_202008.mbox_browser&d=DwMF-g&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=rlQqLTsRN-Zk7JPNbgmYgzlNfDcsR3i0UHYOehNFmKY&s=C8YudYrE_tA-E2w3NVRGI2CeGZE382Hng5tiFLCydEs&e=>
>> (thread continues into September)
>>
>>
>>
>> Regards,
>>
>> Marko
>>
>>
>>
>> On Tue, 6 Apr 2021 at 20:55, Jon Malkin <jm...@apache.org> wrote:
>>
>> I'll echo what Ben said -- if a pre-existing solution does what you need,
>> certainly use that.
>>
>>
>>
>> Having said that, I want to revisit frequent directions in light of the
>> work Charlie did on using it for ridge regression. And when I asked
>> internally I was told that Flink is where at least my company seems to be
>> going for such jobs. So when I get a chance to dive into that, I'll be
>> learning how to do it in Flink.
>>
>>
>>
>>   jon
>>
>>
>>
>> On Tue, Apr 6, 2021 at 11:26 AM Ben Krug <be...@imply.io> wrote:
>>
>> I can't answer about Spark or Flink, but as a druid person, I'll put in a
>> plug for druid for the "if necessary" case.  It can ingest from kafka and
>> aggregate and do sketches during ingestion.  (It's a whole new ballpark,
>> though, if you're not already using it.)
>>
>>
>>
>> On Tue, Apr 6, 2021 at 9:56 AM Alex Garland <ag...@expediagroup.com>
>> wrote:
>>
>> Hi
>>
>>
>>
>> New to DataSketches and looking forward to using, seems like a great
>> library.
>>
>>
>>
>> My team are evaluating it to profile streaming data (in Kafka) in
>> 5-minute windows. The obvious options for stream processing (given
>> experience within our org) would be either Flink or Spark Streaming.
>>
>>
>>
>> Two questions:
>>
>>    - Would I be right in thinking that there are not existing
>>    integrations as libraries for either of these platforms? Absolutely fine if
>>    not, just confirming understanding.
>>    - Is there any view (from either the maintainers or the wider
>>    community) on whether either of those two are easier to integrate with
>>    DataSketches? We would also consider other streaming platforms if
>>    necessary, but as mentioned wider usage within the org would lean us
>>    against that if at all possible.
>>
>>
>>
>> Many thanks
>>
>>