You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Sai Boorlagadda <sa...@gmail.com> on 2019/03/26 22:46:08 UTC

Pinot VS Kafka Streams (or Spark SQL)

Hello Devs,

Is there any content or blog post to understand the difference between
Pinot vs Kafka Streams (or Spark SQL)? At a high level after reading Pinot
is actually bringing off-the-shelf components which otherwise data
engineers have to write on top of KAFKA streams?

I mean I can write a consumer that processes events from a stream and
computes the needed analytic/metric and store it into a data store to
serve. Though my application layer has to route the requests either to the
streaming analytics datastore or batch analytics data store.

So my understanding is Pinot abstracts these two things from the
application. Is that correct?
Sai

Re: Pinot VS Kafka Streams (or Spark SQL)

Posted by Sai Boorlagadda <sa...@gmail.com>.

Thanks, Seunghyun for confirming my understanding.

On Tue, Mar 26, 2019 at 5:06 PM Seunghyun Lee <sn...@apache.org> wrote:

> Hi Sai,
>
> Thank you for the mail. Your understanding is correct.
>
> Systems that you mentioned (Pinot, Kafka, Spark SQL) are built for
> different purposes and they are all critical components when you need to
> build the data analytics data pipeline.
>
> Kafka is a pub-sub message delivering system that is usually used to build
> the streaming data pipeline. Spark SQL is built to add SQL interface layer
> on top of Spark. You can think it as one of offline OLAP query engines like
> Hive or Presto that are used for computing complex ad-hoc queries over
> large data.
>
> Pinot aims to support "interactive" analytics use cases (e.g. dashboard,
> site facing reporting - who's viewed my profile on LInkedIn) where latency
> requirements are more strict than offline reporting use cases.
>
> You mentioned that "I can write a consumer that processes events from a
> stream and computes the needed analytic/metric and store it into a data
> store to serve". This is basically what Pinot does (consuming data from a
> stream, index it for serving, allow query interface)
>
> Below links are some references that we have published to public. Please
> let us know if you have any other question.
>
> Best,
> Seunghyun
>
> cc.
>
> LinkedIn Blog Posts
>
> https://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot
> https://engineering.linkedin.com/blog/2019/03/pinot-joins-apache-incubator
>
> Slides
> https://www.slideshare.net/jeanfrancoisim/intro-to-pinot-20160104
>
> https://www.slideshare.net/seunghyunlee1460/pinot-realtime-olap-for-530-million-users-sigmod-2018-107394584
>
> On Tue, Mar 26, 2019 at 3:46 PM Sai Boorlagadda <sai.boorlagadda@gmail.com
> >
> wrote:
>
> > Hello Devs,
> >
> > Is there any content or blog post to understand the difference between
> > Pinot vs Kafka Streams (or Spark SQL)? At a high level after reading
> Pinot
> > is actually bringing off-the-shelf components which otherwise data
> > engineers have to write on top of KAFKA streams?
> >
> > I mean I can write a consumer that processes events from a stream and
> > computes the needed analytic/metric and store it into a data store to
> > serve. Though my application layer has to route the requests either to
> the
> > streaming analytics datastore or batch analytics data store.
> >
> > So my understanding is Pinot abstracts these two things from the
> > application. Is that correct?
> > Sai
> >
>

Re: Pinot VS Kafka Streams (or Spark SQL)

Posted by Seunghyun Lee <sn...@apache.org>.

Hi Sai,

Thank you for the mail. Your understanding is correct.

Systems that you mentioned (Pinot, Kafka, Spark SQL) are built for
different purposes and they are all critical components when you need to
build the data analytics data pipeline.

Kafka is a pub-sub message delivering system that is usually used to build
the streaming data pipeline. Spark SQL is built to add SQL interface layer
on top of Spark. You can think it as one of offline OLAP query engines like
Hive or Presto that are used for computing complex ad-hoc queries over
large data.

Pinot aims to support "interactive" analytics use cases (e.g. dashboard,
site facing reporting - who's viewed my profile on LInkedIn) where latency
requirements are more strict than offline reporting use cases.

You mentioned that "I can write a consumer that processes events from a
stream and computes the needed analytic/metric and store it into a data
store to serve". This is basically what Pinot does (consuming data from a
stream, index it for serving, allow query interface)

Below links are some references that we have published to public. Please
let us know if you have any other question.

Best,
Seunghyun

cc.

LinkedIn Blog Posts
https://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot
https://engineering.linkedin.com/blog/2019/03/pinot-joins-apache-incubator

Slides
https://www.slideshare.net/jeanfrancoisim/intro-to-pinot-20160104
https://www.slideshare.net/seunghyunlee1460/pinot-realtime-olap-for-530-million-users-sigmod-2018-107394584

On Tue, Mar 26, 2019 at 3:46 PM Sai Boorlagadda <sa...@gmail.com>
wrote:

> Hello Devs,
>
> Is there any content or blog post to understand the difference between
> Pinot vs Kafka Streams (or Spark SQL)? At a high level after reading Pinot
> is actually bringing off-the-shelf components which otherwise data
> engineers have to write on top of KAFKA streams?
>
> I mean I can write a consumer that processes events from a stream and
> computes the needed analytic/metric and store it into a data store to
> serve. Though my application layer has to route the requests either to the
> streaming analytics datastore or batch analytics data store.
>
> So my understanding is Pinot abstracts these two things from the
> application. Is that correct?
> Sai
>