You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by kant kodali <ka...@gmail.com> on 2017/11/09 12:30:30 UTC

Can Apache Drill perform streaming queries?

Hi All,

I am new to Apache Drill. I am wondering if Apache Drill can perform
Streaming Queries? For example, I have a constant stream of data in 24 hour
period and I would like to get updates as soon as I receive them.

Do I need to have a polling thread that issues a Drill query every second?

Thanks!

Re: Can Apache Drill perform streaming queries?

Posted by kant kodali <ka...@gmail.com>.
Hi Saurabh,

Yes those concept do exist in Spark SQL and Spark in general is awesome but
what Spark SQL lacks is the REST interface where user can submit normal or
streaming queries via REST and get the results out . Right now, a user have
to write imperative code to achieve whatever they want and whenever
requirements change like addition of new queries one need to go back and
change the spark code again. so its not as simple as submitting a new query
via REST. I don't see any engine that can do this as of today.

Thanks!

On Thu, Nov 9, 2017 at 6:41 PM, Saurabh Mahapatra <
saurabhmahapatra94@gmail.com> wrote:

> Hi Anil,
>
> I think the start and offset feature would be very useful. Interestingly,
> Kafka SQL has the concept of a tumbling window that is measured in seconds.
>
> https://www.confluent.io/product/ksql/
>
> I think we should have that concept as well because an end user will not
> know what an offset really means unless they have deep knowledge of the
> guts of the stream itself.
>
> Yep, incremental updates with windowing seems to be the right semantics. As
> a user, I expect the data (complete or incomplete)to flow through this "SQL
> transformer" and I should get a real-time view of this transformed data.
> The heavier the SQL workload, the more will be the latency between the
> transformed output and the input.
>
> Does continuous update mean you will introduce trigger semantics in Drill?
>
> By the way, the above ideas seem to exist in SparkSQL (structured
> streaming). They also seem to have the concept of an event time window:
>
> https://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html#operations-on-streaming-dataframesdatasets
>
> Confluent's claim that that they are turning the database inside out
> through streams seems self serving. Because from an analytics
> standpoint-the accuracy of the data depends on whether the data is complete
> in the stream itself i.e. the SQL transformer is a time-based function
> operating on an event stream. Data is typically defined as complete by the
> time it enters the data warehouse.
>
> Best,
> Saurabh
>
>
>
> On Thu, Nov 9, 2017 at 1:10 PM, AnilKumar B <ak...@gmail.com> wrote:
>
> > You are correct Kant.
> >
> > It will be great, If you can raise a JIRA for discussing *feasibility* of
> > incremental query support for Drill. Because, I can also see this is a
> very
> > good requirement for plugins like Kafka, HBase and Cassandra and thanks
> for
> > asking this question.
> >
> > Thanks & Regards,
> > B Anil Kumar.
> >
> > On Thu, Nov 9, 2017 at 12:45 PM, kant kodali <ka...@gmail.com> wrote:
> >
> > > HI Anil,
> > >
> > > Thanks a lot for your response and look like I am indeed looking for
> > > incremental queries. so if I have a thread that polls every second to
> get
> > > the latest updates I just have to change partition values to minimize
> the
> > > scans right?
> > >
> > > Also I guess I can build some notification mechanism in case if my
> older
> > > partitions have an update.
> > >
> > > Thanks!
> > >
> > >
> > >
> > >
> > > On Thu, Nov 9, 2017 at 11:58 AM, AnilKumar B <ak...@gmail.com>
> > > wrote:
> > >
> > > > Hi Kant,
> > > >
> > > > If I understand your questions properly, you are looking for
> > incremental
> > > > queries.
> > > >
> > > > Drill supports predicates pushed down with most of the Data sources.
> In
> > > > your case, suppose you are generating hourly partitions in HDFS using
> > > Spark
> > > > aplication. Then Drill is optmized to scan specific partition based
> on
> > > > query predicates(by using partition pruning) like for example
> > > > https://issues.apache.org/jira/browse/DRILL-3121.
> > > >
> > > > But Drill will not manage any checkpointing. So If BI/Dashboards
> tools
> > > like
> > > > Tableau etc can support this checkpointing then it's possible to
> > connect
> > > > with Drill incrementally.
> > > >
> > > > Coming to latest Kafka storage plugin, In first version we are
> > targetting
> > > > to support batch, I mean, at query time it will fetch all the
> messages
> > > from
> > > > start to end offsets for each topic partition and processes the data.
> > > > Currently it will support JSON and in next version we are targetting
> > for
> > > > Avro support with schema registry. We are also discussing on
> > fiseability
> > > > for metioning start and end offsset ranges, so that we can acheive
> > > > incremental support by managing checkpoining externally.
> > > >
> > > > Thanks,
> > > > B Anil Kumar.
> > > >
> > > > Thanks & Regards,
> > > > B Anil Kumar.
> > > >
> > > > On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <ka...@gmail.com>
> > wrote:
> > > >
> > > > > Can someone elaborate on what happens underneath if I poll every
> > second
> > > > > (Specifically related to my questions in my previous email)?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Confluent has a non-Apache product, I think, for streaming SQL.
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <
> > > smahapatra@mapr.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Isn't there the new Kafka plugin? What does that exactly do?
> > > > > > >
> > > > > > > Best,
> > > > > > > Saurabh
> > > > > > >
> > > > > > > Sent from my iPhone
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > Hi Tug,
> > > > > > > >
> > > > > > > > It's Parquet data on HDFS and the data to HDFS is constantly
> > > > written
> > > > > by
> > > > > > > > spark while consuming from Kafka.
> > > > > > > >
> > > > > > > > Is polling a common technique for say real time analytics
> > > > dashboard ?
> > > > > > > More
> > > > > > > > importantly if I poll does Drill due the scan every time? if
> > the
> > > > > answer
> > > > > > > is
> > > > > > > > no, how does it know which is the new data? since the data is
> > > > written
> > > > > > > HDFS
> > > > > > > > constantly as a stream (The query can be the same however the
> > new
> > > > > data
> > > > > > > will
> > > > > > > > be appended or updated to HDFS in parquet format as a
> stream).
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <
> > > tugdual@gmail.com>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> Hello,
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Today Drill cannot do continuous/streaming query, so as you
> > > > > mentioned
> > > > > > > you
> > > > > > > >> will have to use a polling technique.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Just out of curiosity, Which data source are you planning to
> > > use ?
> > > > > > > >>
> > > > > > > >> Regards
> > > > > > > >> Tug
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <
> kanth909@gmail.com>
> > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> Hi All,
> > > > > > > >>>
> > > > > > > >>> I am new to Apache Drill. I am wondering if Apache Drill
> can
> > > > > perform
> > > > > > > >>> Streaming Queries? For example, I have a constant stream of
> > > data
> > > > in
> > > > > > 24
> > > > > > > >> hour
> > > > > > > >>> period and I would like to get updates as soon as I receive
> > > them.
> > > > > > > >>>
> > > > > > > >>> Do I need to have a polling thread that issues a Drill
> query
> > > > every
> > > > > > > >> second?
> > > > > > > >>>
> > > > > > > >>> Thanks!
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by Saurabh Mahapatra <sa...@gmail.com>.
Hi Anil,

I think the start and offset feature would be very useful. Interestingly,
Kafka SQL has the concept of a tumbling window that is measured in seconds.

https://www.confluent.io/product/ksql/

I think we should have that concept as well because an end user will not
know what an offset really means unless they have deep knowledge of the
guts of the stream itself.

Yep, incremental updates with windowing seems to be the right semantics. As
a user, I expect the data (complete or incomplete)to flow through this "SQL
transformer" and I should get a real-time view of this transformed data.
The heavier the SQL workload, the more will be the latency between the
transformed output and the input.

Does continuous update mean you will introduce trigger semantics in Drill?

By the way, the above ideas seem to exist in SparkSQL (structured
streaming). They also seem to have the concept of an event time window:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#operations-on-streaming-dataframesdatasets

Confluent's claim that that they are turning the database inside out
through streams seems self serving. Because from an analytics
standpoint-the accuracy of the data depends on whether the data is complete
in the stream itself i.e. the SQL transformer is a time-based function
operating on an event stream. Data is typically defined as complete by the
time it enters the data warehouse.

Best,
Saurabh



On Thu, Nov 9, 2017 at 1:10 PM, AnilKumar B <ak...@gmail.com> wrote:

> You are correct Kant.
>
> It will be great, If you can raise a JIRA for discussing *feasibility* of
> incremental query support for Drill. Because, I can also see this is a very
> good requirement for plugins like Kafka, HBase and Cassandra and thanks for
> asking this question.
>
> Thanks & Regards,
> B Anil Kumar.
>
> On Thu, Nov 9, 2017 at 12:45 PM, kant kodali <ka...@gmail.com> wrote:
>
> > HI Anil,
> >
> > Thanks a lot for your response and look like I am indeed looking for
> > incremental queries. so if I have a thread that polls every second to get
> > the latest updates I just have to change partition values to minimize the
> > scans right?
> >
> > Also I guess I can build some notification mechanism in case if my older
> > partitions have an update.
> >
> > Thanks!
> >
> >
> >
> >
> > On Thu, Nov 9, 2017 at 11:58 AM, AnilKumar B <ak...@gmail.com>
> > wrote:
> >
> > > Hi Kant,
> > >
> > > If I understand your questions properly, you are looking for
> incremental
> > > queries.
> > >
> > > Drill supports predicates pushed down with most of the Data sources. In
> > > your case, suppose you are generating hourly partitions in HDFS using
> > Spark
> > > aplication. Then Drill is optmized to scan specific partition based on
> > > query predicates(by using partition pruning) like for example
> > > https://issues.apache.org/jira/browse/DRILL-3121.
> > >
> > > But Drill will not manage any checkpointing. So If BI/Dashboards tools
> > like
> > > Tableau etc can support this checkpointing then it's possible to
> connect
> > > with Drill incrementally.
> > >
> > > Coming to latest Kafka storage plugin, In first version we are
> targetting
> > > to support batch, I mean, at query time it will fetch all the messages
> > from
> > > start to end offsets for each topic partition and processes the data.
> > > Currently it will support JSON and in next version we are targetting
> for
> > > Avro support with schema registry. We are also discussing on
> fiseability
> > > for metioning start and end offsset ranges, so that we can acheive
> > > incremental support by managing checkpoining externally.
> > >
> > > Thanks,
> > > B Anil Kumar.
> > >
> > > Thanks & Regards,
> > > B Anil Kumar.
> > >
> > > On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <ka...@gmail.com>
> wrote:
> > >
> > > > Can someone elaborate on what happens underneath if I poll every
> second
> > > > (Specifically related to my questions in my previous email)?
> > > >
> > > > Thanks!
> > > >
> > > > On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > > > Confluent has a non-Apache product, I think, for streaming SQL.
> > > > >
> > > > >
> > > > > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <
> > smahapatra@mapr.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Isn't there the new Kafka plugin? What does that exactly do?
> > > > > >
> > > > > > Best,
> > > > > > Saurabh
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > Hi Tug,
> > > > > > >
> > > > > > > It's Parquet data on HDFS and the data to HDFS is constantly
> > > written
> > > > by
> > > > > > > spark while consuming from Kafka.
> > > > > > >
> > > > > > > Is polling a common technique for say real time analytics
> > > dashboard ?
> > > > > > More
> > > > > > > importantly if I poll does Drill due the scan every time? if
> the
> > > > answer
> > > > > > is
> > > > > > > no, how does it know which is the new data? since the data is
> > > written
> > > > > > HDFS
> > > > > > > constantly as a stream (The query can be the same however the
> new
> > > > data
> > > > > > will
> > > > > > > be appended or updated to HDFS in parquet format as a stream).
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <
> > tugdual@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Hello,
> > > > > > >>
> > > > > > >>
> > > > > > >> Today Drill cannot do continuous/streaming query, so as you
> > > > mentioned
> > > > > > you
> > > > > > >> will have to use a polling technique.
> > > > > > >>
> > > > > > >>
> > > > > > >> Just out of curiosity, Which data source are you planning to
> > use ?
> > > > > > >>
> > > > > > >> Regards
> > > > > > >> Tug
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com>
> > > > wrote:
> > > > > > >>>
> > > > > > >>> Hi All,
> > > > > > >>>
> > > > > > >>> I am new to Apache Drill. I am wondering if Apache Drill can
> > > > perform
> > > > > > >>> Streaming Queries? For example, I have a constant stream of
> > data
> > > in
> > > > > 24
> > > > > > >> hour
> > > > > > >>> period and I would like to get updates as soon as I receive
> > them.
> > > > > > >>>
> > > > > > >>> Do I need to have a polling thread that issues a Drill query
> > > every
> > > > > > >> second?
> > > > > > >>>
> > > > > > >>> Thanks!
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by AnilKumar B <ak...@gmail.com>.
You are correct Kant.

It will be great, If you can raise a JIRA for discussing *feasibility* of
incremental query support for Drill. Because, I can also see this is a very
good requirement for plugins like Kafka, HBase and Cassandra and thanks for
asking this question.

Thanks & Regards,
B Anil Kumar.

On Thu, Nov 9, 2017 at 12:45 PM, kant kodali <ka...@gmail.com> wrote:

> HI Anil,
>
> Thanks a lot for your response and look like I am indeed looking for
> incremental queries. so if I have a thread that polls every second to get
> the latest updates I just have to change partition values to minimize the
> scans right?
>
> Also I guess I can build some notification mechanism in case if my older
> partitions have an update.
>
> Thanks!
>
>
>
>
> On Thu, Nov 9, 2017 at 11:58 AM, AnilKumar B <ak...@gmail.com>
> wrote:
>
> > Hi Kant,
> >
> > If I understand your questions properly, you are looking for incremental
> > queries.
> >
> > Drill supports predicates pushed down with most of the Data sources. In
> > your case, suppose you are generating hourly partitions in HDFS using
> Spark
> > aplication. Then Drill is optmized to scan specific partition based on
> > query predicates(by using partition pruning) like for example
> > https://issues.apache.org/jira/browse/DRILL-3121.
> >
> > But Drill will not manage any checkpointing. So If BI/Dashboards tools
> like
> > Tableau etc can support this checkpointing then it's possible to connect
> > with Drill incrementally.
> >
> > Coming to latest Kafka storage plugin, In first version we are targetting
> > to support batch, I mean, at query time it will fetch all the messages
> from
> > start to end offsets for each topic partition and processes the data.
> > Currently it will support JSON and in next version we are targetting for
> > Avro support with schema registry. We are also discussing on fiseability
> > for metioning start and end offsset ranges, so that we can acheive
> > incremental support by managing checkpoining externally.
> >
> > Thanks,
> > B Anil Kumar.
> >
> > Thanks & Regards,
> > B Anil Kumar.
> >
> > On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <ka...@gmail.com> wrote:
> >
> > > Can someone elaborate on what happens underneath if I poll every second
> > > (Specifically related to my questions in my previous email)?
> > >
> > > Thanks!
> > >
> > > On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > > Confluent has a non-Apache product, I think, for streaming SQL.
> > > >
> > > >
> > > > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <
> smahapatra@mapr.com
> > >
> > > > wrote:
> > > >
> > > > > Isn't there the new Kafka plugin? What does that exactly do?
> > > > >
> > > > > Best,
> > > > > Saurabh
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > >
> > > > >
> > > > > > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Hi Tug,
> > > > > >
> > > > > > It's Parquet data on HDFS and the data to HDFS is constantly
> > written
> > > by
> > > > > > spark while consuming from Kafka.
> > > > > >
> > > > > > Is polling a common technique for say real time analytics
> > dashboard ?
> > > > > More
> > > > > > importantly if I poll does Drill due the scan every time? if the
> > > answer
> > > > > is
> > > > > > no, how does it know which is the new data? since the data is
> > written
> > > > > HDFS
> > > > > > constantly as a stream (The query can be the same however the new
> > > data
> > > > > will
> > > > > > be appended or updated to HDFS in parquet format as a stream).
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <
> tugdual@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> Hello,
> > > > > >>
> > > > > >>
> > > > > >> Today Drill cannot do continuous/streaming query, so as you
> > > mentioned
> > > > > you
> > > > > >> will have to use a polling technique.
> > > > > >>
> > > > > >>
> > > > > >> Just out of curiosity, Which data source are you planning to
> use ?
> > > > > >>
> > > > > >> Regards
> > > > > >> Tug
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com>
> > > wrote:
> > > > > >>>
> > > > > >>> Hi All,
> > > > > >>>
> > > > > >>> I am new to Apache Drill. I am wondering if Apache Drill can
> > > perform
> > > > > >>> Streaming Queries? For example, I have a constant stream of
> data
> > in
> > > > 24
> > > > > >> hour
> > > > > >>> period and I would like to get updates as soon as I receive
> them.
> > > > > >>>
> > > > > >>> Do I need to have a polling thread that issues a Drill query
> > every
> > > > > >> second?
> > > > > >>>
> > > > > >>> Thanks!
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by kant kodali <ka...@gmail.com>.
HI Anil,

Thanks a lot for your response and look like I am indeed looking for
incremental queries. so if I have a thread that polls every second to get
the latest updates I just have to change partition values to minimize the
scans right?

Also I guess I can build some notification mechanism in case if my older
partitions have an update.

Thanks!




On Thu, Nov 9, 2017 at 11:58 AM, AnilKumar B <ak...@gmail.com> wrote:

> Hi Kant,
>
> If I understand your questions properly, you are looking for incremental
> queries.
>
> Drill supports predicates pushed down with most of the Data sources. In
> your case, suppose you are generating hourly partitions in HDFS using Spark
> aplication. Then Drill is optmized to scan specific partition based on
> query predicates(by using partition pruning) like for example
> https://issues.apache.org/jira/browse/DRILL-3121.
>
> But Drill will not manage any checkpointing. So If BI/Dashboards tools like
> Tableau etc can support this checkpointing then it's possible to connect
> with Drill incrementally.
>
> Coming to latest Kafka storage plugin, In first version we are targetting
> to support batch, I mean, at query time it will fetch all the messages from
> start to end offsets for each topic partition and processes the data.
> Currently it will support JSON and in next version we are targetting for
> Avro support with schema registry. We are also discussing on fiseability
> for metioning start and end offsset ranges, so that we can acheive
> incremental support by managing checkpoining externally.
>
> Thanks,
> B Anil Kumar.
>
> Thanks & Regards,
> B Anil Kumar.
>
> On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <ka...@gmail.com> wrote:
>
> > Can someone elaborate on what happens underneath if I poll every second
> > (Specifically related to my questions in my previous email)?
> >
> > Thanks!
> >
> > On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > Confluent has a non-Apache product, I think, for streaming SQL.
> > >
> > >
> > > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <smahapatra@mapr.com
> >
> > > wrote:
> > >
> > > > Isn't there the new Kafka plugin? What does that exactly do?
> > > >
> > > > Best,
> > > > Saurabh
> > > >
> > > > Sent from my iPhone
> > > >
> > > >
> > > >
> > > > > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com>
> wrote:
> > > > >
> > > > > Hi Tug,
> > > > >
> > > > > It's Parquet data on HDFS and the data to HDFS is constantly
> written
> > by
> > > > > spark while consuming from Kafka.
> > > > >
> > > > > Is polling a common technique for say real time analytics
> dashboard ?
> > > > More
> > > > > importantly if I poll does Drill due the scan every time? if the
> > answer
> > > > is
> > > > > no, how does it know which is the new data? since the data is
> written
> > > > HDFS
> > > > > constantly as a stream (The query can be the same however the new
> > data
> > > > will
> > > > > be appended or updated to HDFS in parquet format as a stream).
> > > > >
> > > > > Thanks!
> > > > >
> > > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <tu...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> Hello,
> > > > >>
> > > > >>
> > > > >> Today Drill cannot do continuous/streaming query, so as you
> > mentioned
> > > > you
> > > > >> will have to use a polling technique.
> > > > >>
> > > > >>
> > > > >> Just out of curiosity, Which data source are you planning to use ?
> > > > >>
> > > > >> Regards
> > > > >> Tug
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com>
> > wrote:
> > > > >>>
> > > > >>> Hi All,
> > > > >>>
> > > > >>> I am new to Apache Drill. I am wondering if Apache Drill can
> > perform
> > > > >>> Streaming Queries? For example, I have a constant stream of data
> in
> > > 24
> > > > >> hour
> > > > >>> period and I would like to get updates as soon as I receive them.
> > > > >>>
> > > > >>> Do I need to have a polling thread that issues a Drill query
> every
> > > > >> second?
> > > > >>>
> > > > >>> Thanks!
> > > > >>>
> > > > >>
> > > >
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by AnilKumar B <ak...@gmail.com>.
Hi Kant,

If I understand your questions properly, you are looking for incremental
queries.

Drill supports predicates pushed down with most of the Data sources. In
your case, suppose you are generating hourly partitions in HDFS using Spark
aplication. Then Drill is optmized to scan specific partition based on
query predicates(by using partition pruning) like for example
https://issues.apache.org/jira/browse/DRILL-3121.

But Drill will not manage any checkpointing. So If BI/Dashboards tools like
Tableau etc can support this checkpointing then it's possible to connect
with Drill incrementally.

Coming to latest Kafka storage plugin, In first version we are targetting
to support batch, I mean, at query time it will fetch all the messages from
start to end offsets for each topic partition and processes the data.
Currently it will support JSON and in next version we are targetting for
Avro support with schema registry. We are also discussing on fiseability
for metioning start and end offsset ranges, so that we can acheive
incremental support by managing checkpoining externally.

Thanks,
B Anil Kumar.

Thanks & Regards,
B Anil Kumar.

On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <ka...@gmail.com> wrote:

> Can someone elaborate on what happens underneath if I poll every second
> (Specifically related to my questions in my previous email)?
>
> Thanks!
>
> On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > Confluent has a non-Apache product, I think, for streaming SQL.
> >
> >
> > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <sm...@mapr.com>
> > wrote:
> >
> > > Isn't there the new Kafka plugin? What does that exactly do?
> > >
> > > Best,
> > > Saurabh
> > >
> > > Sent from my iPhone
> > >
> > >
> > >
> > > > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com> wrote:
> > > >
> > > > Hi Tug,
> > > >
> > > > It's Parquet data on HDFS and the data to HDFS is constantly written
> by
> > > > spark while consuming from Kafka.
> > > >
> > > > Is polling a common technique for say real time analytics dashboard ?
> > > More
> > > > importantly if I poll does Drill due the scan every time? if the
> answer
> > > is
> > > > no, how does it know which is the new data? since the data is written
> > > HDFS
> > > > constantly as a stream (The query can be the same however the new
> data
> > > will
> > > > be appended or updated to HDFS in parquet format as a stream).
> > > >
> > > > Thanks!
> > > >
> > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <tu...@gmail.com>
> > > wrote:
> > > >>
> > > >> Hello,
> > > >>
> > > >>
> > > >> Today Drill cannot do continuous/streaming query, so as you
> mentioned
> > > you
> > > >> will have to use a polling technique.
> > > >>
> > > >>
> > > >> Just out of curiosity, Which data source are you planning to use ?
> > > >>
> > > >> Regards
> > > >> Tug
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com>
> wrote:
> > > >>>
> > > >>> Hi All,
> > > >>>
> > > >>> I am new to Apache Drill. I am wondering if Apache Drill can
> perform
> > > >>> Streaming Queries? For example, I have a constant stream of data in
> > 24
> > > >> hour
> > > >>> period and I would like to get updates as soon as I receive them.
> > > >>>
> > > >>> Do I need to have a polling thread that issues a Drill query every
> > > >> second?
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by kant kodali <ka...@gmail.com>.
Can someone elaborate on what happens underneath if I poll every second
(Specifically related to my questions in my previous email)?

Thanks!

On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <te...@gmail.com> wrote:

> Confluent has a non-Apache product, I think, for streaming SQL.
>
>
> On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <sm...@mapr.com>
> wrote:
>
> > Isn't there the new Kafka plugin? What does that exactly do?
> >
> > Best,
> > Saurabh
> >
> > Sent from my iPhone
> >
> >
> >
> > > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com> wrote:
> > >
> > > Hi Tug,
> > >
> > > It's Parquet data on HDFS and the data to HDFS is constantly written by
> > > spark while consuming from Kafka.
> > >
> > > Is polling a common technique for say real time analytics dashboard ?
> > More
> > > importantly if I poll does Drill due the scan every time? if the answer
> > is
> > > no, how does it know which is the new data? since the data is written
> > HDFS
> > > constantly as a stream (The query can be the same however the new data
> > will
> > > be appended or updated to HDFS in parquet format as a stream).
> > >
> > > Thanks!
> > >
> > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <tu...@gmail.com>
> > wrote:
> > >>
> > >> Hello,
> > >>
> > >>
> > >> Today Drill cannot do continuous/streaming query, so as you mentioned
> > you
> > >> will have to use a polling technique.
> > >>
> > >>
> > >> Just out of curiosity, Which data source are you planning to use ?
> > >>
> > >> Regards
> > >> Tug
> > >>
> > >>
> > >>
> > >>
> > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com> wrote:
> > >>>
> > >>> Hi All,
> > >>>
> > >>> I am new to Apache Drill. I am wondering if Apache Drill can perform
> > >>> Streaming Queries? For example, I have a constant stream of data in
> 24
> > >> hour
> > >>> period and I would like to get updates as soon as I receive them.
> > >>>
> > >>> Do I need to have a polling thread that issues a Drill query every
> > >> second?
> > >>>
> > >>> Thanks!
> > >>>
> > >>
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by Ted Dunning <te...@gmail.com>.
Confluent has a non-Apache product, I think, for streaming SQL.


On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <sm...@mapr.com>
wrote:

> Isn't there the new Kafka plugin? What does that exactly do?
>
> Best,
> Saurabh
>
> Sent from my iPhone
>
>
>
> > On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com> wrote:
> >
> > Hi Tug,
> >
> > It's Parquet data on HDFS and the data to HDFS is constantly written by
> > spark while consuming from Kafka.
> >
> > Is polling a common technique for say real time analytics dashboard ?
> More
> > importantly if I poll does Drill due the scan every time? if the answer
> is
> > no, how does it know which is the new data? since the data is written
> HDFS
> > constantly as a stream (The query can be the same however the new data
> will
> > be appended or updated to HDFS in parquet format as a stream).
> >
> > Thanks!
> >
> >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <tu...@gmail.com>
> wrote:
> >>
> >> Hello,
> >>
> >>
> >> Today Drill cannot do continuous/streaming query, so as you mentioned
> you
> >> will have to use a polling technique.
> >>
> >>
> >> Just out of curiosity, Which data source are you planning to use ?
> >>
> >> Regards
> >> Tug
> >>
> >>
> >>
> >>
> >>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I am new to Apache Drill. I am wondering if Apache Drill can perform
> >>> Streaming Queries? For example, I have a constant stream of data in 24
> >> hour
> >>> period and I would like to get updates as soon as I receive them.
> >>>
> >>> Do I need to have a polling thread that issues a Drill query every
> >> second?
> >>>
> >>> Thanks!
> >>>
> >>
>

Re: Can Apache Drill perform streaming queries?

Posted by Saurabh Mahapatra <sm...@mapr.com>.
Isn't there the new Kafka plugin? What does that exactly do?

Best,
Saurabh

Sent from my iPhone



> On Nov 9, 2017, at 5:15 AM, kant kodali <ka...@gmail.com> wrote:
> 
> Hi Tug,
> 
> It's Parquet data on HDFS and the data to HDFS is constantly written by
> spark while consuming from Kafka.
> 
> Is polling a common technique for say real time analytics dashboard ? More
> importantly if I poll does Drill due the scan every time? if the answer is
> no, how does it know which is the new data? since the data is written HDFS
> constantly as a stream (The query can be the same however the new data will
> be appended or updated to HDFS in parquet format as a stream).
> 
> Thanks!
> 
>> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <tu...@gmail.com> wrote:
>> 
>> Hello,
>> 
>> 
>> Today Drill cannot do continuous/streaming query, so as you mentioned you
>> will have to use a polling technique.
>> 
>> 
>> Just out of curiosity, Which data source are you planning to use ?
>> 
>> Regards
>> Tug
>> 
>> 
>> 
>> 
>>> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> I am new to Apache Drill. I am wondering if Apache Drill can perform
>>> Streaming Queries? For example, I have a constant stream of data in 24
>> hour
>>> period and I would like to get updates as soon as I receive them.
>>> 
>>> Do I need to have a polling thread that issues a Drill query every
>> second?
>>> 
>>> Thanks!
>>> 
>> 

Re: Can Apache Drill perform streaming queries?

Posted by kant kodali <ka...@gmail.com>.
Hi Tug,

It's Parquet data on HDFS and the data to HDFS is constantly written by
spark while consuming from Kafka.

Is polling a common technique for say real time analytics dashboard ? More
importantly if I poll does Drill due the scan every time? if the answer is
no, how does it know which is the new data? since the data is written HDFS
constantly as a stream (The query can be the same however the new data will
be appended or updated to HDFS in parquet format as a stream).

Thanks!

On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <tu...@gmail.com> wrote:

> Hello,
>
>
> Today Drill cannot do continuous/streaming query, so as you mentioned you
> will have to use a polling technique.
>
>
> Just out of curiosity, Which data source are you planning to use ?
>
> Regards
> Tug
>
>
>
>
> On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com> wrote:
>
> > Hi All,
> >
> > I am new to Apache Drill. I am wondering if Apache Drill can perform
> > Streaming Queries? For example, I have a constant stream of data in 24
> hour
> > period and I would like to get updates as soon as I receive them.
> >
> > Do I need to have a polling thread that issues a Drill query every
> second?
> >
> > Thanks!
> >
>

Re: Can Apache Drill perform streaming queries?

Posted by Tugdual Grall <tu...@gmail.com>.
Hello,


Today Drill cannot do continuous/streaming query, so as you mentioned you
will have to use a polling technique.


Just out of curiosity, Which data source are you planning to use ?

Regards
Tug




On Thu 9 Nov 2017 at 04:31, kant kodali <ka...@gmail.com> wrote:

> Hi All,
>
> I am new to Apache Drill. I am wondering if Apache Drill can perform
> Streaming Queries? For example, I have a constant stream of data in 24 hour
> period and I would like to get updates as soon as I receive them.
>
> Do I need to have a polling thread that issues a Drill query every second?
>
> Thanks!
>