You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sunita Arvind <su...@gmail.com> on 2016/05/06 16:21:24 UTC

Adhoc queries on Spark 2.0 with Structured Streaming

Hi All,

We are evaluating a few real time streaming query engines and spark is my
personal choice. The addition of adhoc queries is what is getting me
further excited about it, however the talks I have heard so far only
mention about it but do not provide details. I need to build a prototype to
ensure it works for our use cases.

Can someone point me to relevant material for this.

regards
Sunita

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Sunita Arvind <su...@gmail.com>.

Agreed.
Just sharing what I saw,
http://www.slideshare.net/databricks/realtime-spark-from-interactive-queries-to-streaming

http://www.slideshare.net/rxin/the-future-of-realtime-in-spark?next_slideshow=3

It claims to support kafka, files and databases. However, continuous SQL
will be available in 2.1 or later only

regards
Sunita


On Fri, May 6, 2016 at 1:06 PM, Michael Malak <mi...@yahoo.com>
wrote:

> At first glance, it looks like the only streaming data sources available
> out of the box from the github master branch are
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
>  and
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala .
> Out of the Jira epic for Structured Streaming
> https://issues.apache.org/jira/browse/SPARK-8360 it would seem the
> still-open https://issues.apache.org/jira/browse/SPARK-10815 "API design:
> data sources and sinks" is relevant here.
>
> In short, it would seem the code is not there yet to create a Kafka-fed
> Dataframe/Dataset that can be queried with Structured Streaming; or if it
> is, it's not obvious how to write such code.
>
>
> ------------------------------
> *From:* Anthony May <an...@gmail.com>
> *To:* Deepak Sharma <de...@gmail.com>; Sunita Arvind <
> sunitarvind@gmail.com>
> *Cc:* "user@spark.apache.org" <us...@spark.apache.org>
> *Sent:* Friday, May 6, 2016 11:50 AM
> *Subject:* Re: Adhoc queries on Spark 2.0 with Structured Streaming
>
> Yeah, there isn't even a RC yet and no documentation but you can work off
> the code base and test suites:
> https://github.com/apache/spark
> And this might help:
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
>
> On Fri, 6 May 2016 at 11:07 Deepak Sharma <de...@gmail.com> wrote:
>
> Spark 2.0 is yet to come out for public release.
> I am waiting to get hands on it as well.
> Please do let me know if i can download source and build spark2.0 from
> github.
>
> Thanks
> Deepak
>
> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com>
> wrote:
>
> Hi All,
>
> We are evaluating a few real time streaming query engines and spark is my
> personal choice. The addition of adhoc queries is what is getting me
> further excited about it, however the talks I have heard so far only
> mention about it but do not provide details. I need to build a prototype to
> ensure it works for our use cases.
>
> Can someone point me to relevant material for this.
>
> regards
> Sunita
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Sunita Arvind <su...@gmail.com>.

Thanks for the clarification Michael and good luck with Spark 2.0. It
really looks promising.

I am especially interested in adhoc queries aspect. Probably that is what
is being referred to as Continuous SQL in the slides. What is the timeframe
for availability this functionality?

regards
Sunita

On Fri, May 6, 2016 at 2:24 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> That is a forward looking design doc and not all of it has been
> implemented yet.  With Spark 2.0 the main sources and sinks will be file
> based, though we hope to quickly expand that now that a lot of
> infrastructure is in place.
>
> On Fri, May 6, 2016 at 2:11 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> I was
>> reading StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
>> attached to SPARK-8360
>>
>> On page 12, there was mentioning of .format(“kafka”) but I searched the
>> codebase and didn't find any occurrence.
>>
>> FYI
>>
>> On Fri, May 6, 2016 at 1:06 PM, Michael Malak <
>> michaelmalak@yahoo.com.invalid> wrote:
>>
>>> At first glance, it looks like the only streaming data sources available
>>> out of the box from the github master branch are
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
>>>  and
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala .
>>> Out of the Jira epic for Structured Streaming
>>> https://issues.apache.org/jira/browse/SPARK-8360 it would seem the
>>> still-open https://issues.apache.org/jira/browse/SPARK-10815 "API
>>> design: data sources and sinks" is relevant here.
>>>
>>> In short, it would seem the code is not there yet to create a Kafka-fed
>>> Dataframe/Dataset that can be queried with Structured Streaming; or if it
>>> is, it's not obvious how to write such code.
>>>
>>>
>>> ------------------------------
>>> *From:* Anthony May <an...@gmail.com>
>>> *To:* Deepak Sharma <de...@gmail.com>; Sunita Arvind <
>>> sunitarvind@gmail.com>
>>> *Cc:* "user@spark.apache.org" <us...@spark.apache.org>
>>> *Sent:* Friday, May 6, 2016 11:50 AM
>>> *Subject:* Re: Adhoc queries on Spark 2.0 with Structured Streaming
>>>
>>> Yeah, there isn't even a RC yet and no documentation but you can work
>>> off the code base and test suites:
>>> https://github.com/apache/spark
>>> And this might help:
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
>>>
>>> On Fri, 6 May 2016 at 11:07 Deepak Sharma <de...@gmail.com> wrote:
>>>
>>> Spark 2.0 is yet to come out for public release.
>>> I am waiting to get hands on it as well.
>>> Please do let me know if i can download source and build spark2.0 from
>>> github.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com>
>>> wrote:
>>>
>>> Hi All,
>>>
>>> We are evaluating a few real time streaming query engines and spark is
>>> my personal choice. The addition of adhoc queries is what is getting me
>>> further excited about it, however the talks I have heard so far only
>>> mention about it but do not provide details. I need to build a prototype to
>>> ensure it works for our use cases.
>>>
>>> Can someone point me to relevant material for this.
>>>
>>> regards
>>> Sunita
>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>>
>>>
>>>
>>
>

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Michael Armbrust <mi...@databricks.com>.

That is a forward looking design doc and not all of it has been implemented
yet.  With Spark 2.0 the main sources and sinks will be file based, though
we hope to quickly expand that now that a lot of infrastructure is in place.

On Fri, May 6, 2016 at 2:11 PM, Ted Yu <yu...@gmail.com> wrote:

> I was
> reading StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
> attached to SPARK-8360
>
> On page 12, there was mentioning of .format(“kafka”) but I searched the
> codebase and didn't find any occurrence.
>
> FYI
>
> On Fri, May 6, 2016 at 1:06 PM, Michael Malak <
> michaelmalak@yahoo.com.invalid> wrote:
>
>> At first glance, it looks like the only streaming data sources available
>> out of the box from the github master branch are
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
>>  and
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala .
>> Out of the Jira epic for Structured Streaming
>> https://issues.apache.org/jira/browse/SPARK-8360 it would seem the
>> still-open https://issues.apache.org/jira/browse/SPARK-10815 "API
>> design: data sources and sinks" is relevant here.
>>
>> In short, it would seem the code is not there yet to create a Kafka-fed
>> Dataframe/Dataset that can be queried with Structured Streaming; or if it
>> is, it's not obvious how to write such code.
>>
>>
>> ------------------------------
>> *From:* Anthony May <an...@gmail.com>
>> *To:* Deepak Sharma <de...@gmail.com>; Sunita Arvind <
>> sunitarvind@gmail.com>
>> *Cc:* "user@spark.apache.org" <us...@spark.apache.org>
>> *Sent:* Friday, May 6, 2016 11:50 AM
>> *Subject:* Re: Adhoc queries on Spark 2.0 with Structured Streaming
>>
>> Yeah, there isn't even a RC yet and no documentation but you can work off
>> the code base and test suites:
>> https://github.com/apache/spark
>> And this might help:
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
>>
>> On Fri, 6 May 2016 at 11:07 Deepak Sharma <de...@gmail.com> wrote:
>>
>> Spark 2.0 is yet to come out for public release.
>> I am waiting to get hands on it as well.
>> Please do let me know if i can download source and build spark2.0 from
>> github.
>>
>> Thanks
>> Deepak
>>
>> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com>
>> wrote:
>>
>> Hi All,
>>
>> We are evaluating a few real time streaming query engines and spark is my
>> personal choice. The addition of adhoc queries is what is getting me
>> further excited about it, however the talks I have heard so far only
>> mention about it but do not provide details. I need to build a prototype to
>> ensure it works for our use cases.
>>
>> Can someone point me to relevant material for this.
>>
>> regards
>> Sunita
>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>>
>>
>>
>

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Ted Yu <yu...@gmail.com>.

I was
reading StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
attached to SPARK-8360

On page 12, there was mentioning of .format(“kafka”) but I searched the
codebase and didn't find any occurrence.

FYI

On Fri, May 6, 2016 at 1:06 PM, Michael Malak <
michaelmalak@yahoo.com.invalid> wrote:

> At first glance, it looks like the only streaming data sources available
> out of the box from the github master branch are
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
>  and
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala .
> Out of the Jira epic for Structured Streaming
> https://issues.apache.org/jira/browse/SPARK-8360 it would seem the
> still-open https://issues.apache.org/jira/browse/SPARK-10815 "API design:
> data sources and sinks" is relevant here.
>
> In short, it would seem the code is not there yet to create a Kafka-fed
> Dataframe/Dataset that can be queried with Structured Streaming; or if it
> is, it's not obvious how to write such code.
>
>
> ------------------------------
> *From:* Anthony May <an...@gmail.com>
> *To:* Deepak Sharma <de...@gmail.com>; Sunita Arvind <
> sunitarvind@gmail.com>
> *Cc:* "user@spark.apache.org" <us...@spark.apache.org>
> *Sent:* Friday, May 6, 2016 11:50 AM
> *Subject:* Re: Adhoc queries on Spark 2.0 with Structured Streaming
>
> Yeah, there isn't even a RC yet and no documentation but you can work off
> the code base and test suites:
> https://github.com/apache/spark
> And this might help:
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
>
> On Fri, 6 May 2016 at 11:07 Deepak Sharma <de...@gmail.com> wrote:
>
> Spark 2.0 is yet to come out for public release.
> I am waiting to get hands on it as well.
> Please do let me know if i can download source and build spark2.0 from
> github.
>
> Thanks
> Deepak
>
> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com>
> wrote:
>
> Hi All,
>
> We are evaluating a few real time streaming query engines and spark is my
> personal choice. The addition of adhoc queries is what is getting me
> further excited about it, however the talks I have heard so far only
> mention about it but do not provide details. I need to build a prototype to
> ensure it works for our use cases.
>
> Can someone point me to relevant material for this.
>
> regards
> Sunita
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>
>

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Michael Malak <mi...@yahoo.com.INVALID>.

At first glance, it looks like the only streaming data sources available out of the box from the github master branch are https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala and https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala . Out of the Jira epic for Structured Streaming https://issues.apache.org/jira/browse/SPARK-8360 it would seem the still-open https://issues.apache.org/jira/browse/SPARK-10815 "API design: data sources and sinks" is relevant here.
In short, it would seem the code is not there yet to create a Kafka-fed Dataframe/Dataset that can be queried with Structured Streaming; or if it is, it's not obvious how to write such code.

      From: Anthony May <an...@gmail.com>
 To: Deepak Sharma <de...@gmail.com>; Sunita Arvind <su...@gmail.com> 
Cc: "user@spark.apache.org" <us...@spark.apache.org>
 Sent: Friday, May 6, 2016 11:50 AM
 Subject: Re: Adhoc queries on Spark 2.0 with Structured Streaming

Yeah, there isn't even a RC yet and no documentation but you can work off the code base and test suites:
https://github.com/apache/spark
And this might help:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala

On Fri, 6 May 2016 at 11:07 Deepak Sharma <de...@gmail.com> wrote:

Spark 2.0 is yet to come out for public release.
I am waiting to get hands on it as well.
Please do let me know if i can download source and build spark2.0 from github.

Thanks
Deepak

On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com> wrote:

Hi All,

We are evaluating a few real time streaming query engines and spark is my personal choice. The addition of adhoc queries is what is getting me further excited about it, however the talks I have heard so far only mention about it but do not provide details. I need to build a prototype to ensure it works for our use cases. 

Can someone point me to relevant material for this.

regards
Sunita

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Anthony May <an...@gmail.com>.

Yeah, there isn't even a RC yet and no documentation but you can work off
the code base and test suites:
https://github.com/apache/spark
And this might help:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala

On Fri, 6 May 2016 at 11:07 Deepak Sharma <de...@gmail.com> wrote:

> Spark 2.0 is yet to come out for public release.
> I am waiting to get hands on it as well.
> Please do let me know if i can download source and build spark2.0 from
> github.
>
> Thanks
> Deepak
>
> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> We are evaluating a few real time streaming query engines and spark is my
>> personal choice. The addition of adhoc queries is what is getting me
>> further excited about it, however the talks I have heard so far only
>> mention about it but do not provide details. I need to build a prototype to
>> ensure it works for our use cases.
>>
>> Can someone point me to relevant material for this.
>>
>> regards
>> Sunita
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Posted by Deepak Sharma <de...@gmail.com>.

Spark 2.0 is yet to come out for public release.
I am waiting to get hands on it as well.
Please do let me know if i can download source and build spark2.0 from
github.

Thanks
Deepak

On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind <su...@gmail.com> wrote:

> Hi All,
>
> We are evaluating a few real time streaming query engines and spark is my
> personal choice. The addition of adhoc queries is what is getting me
> further excited about it, however the talks I have heard so far only
> mention about it but do not provide details. I need to build a prototype to
> ensure it works for our use cases.
>
> Can someone point me to relevant material for this.
>
> regards
> Sunita
>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net