You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sam Elamin <hu...@gmail.com> on 2017/02/15 19:34:26 UTC

Structured Streaming Spark Summit Demo - Databricks people

Hey folks

This one is mainly aimed at the databricks folks, I have been trying to
replicate the cloudtrail demo <https://www.youtube.com/watch?v=IJmFTXvUZgY>
Micheal did at Spark Summit. The code for it can be found here
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html>

My question is how did you get the results to be displayed and updated
continusly in real time

I am also using databricks to duplicate it but I noticed the code link
mentions

 "If you count the number of rows in the table, you should find the value
increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was
using for the demo is still not released, or at-least the functionality to
display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using
the databricks "display" function?


Regards
Sam

Re: Structured Streaming Spark Summit Demo - Databricks people

Posted by Sam Elamin <hu...@gmail.com>.
Fair enough your absolutely right

Thanks for pointing me in the right direction
On Wed, 15 Feb 2017 at 20:13, Nicholas Chammas <ni...@gmail.com>
wrote:

> I don't think this is the right place for questions about Databricks. I'm
> pretty sure they have their own website with a forum for questions about
> their product.
>
> Maybe this? https://forums.databricks.com/
>
> On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin <hu...@gmail.com>
> wrote:
>
> Hey folks
>
> This one is mainly aimed at the databricks folks, I have been trying to
> replicate the cloudtrail demo
> <https://www.youtube.com/watch?v=IJmFTXvUZgY> Micheal did at Spark
> Summit. The code for it can be found here
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html>
>
> My question is how did you get the results to be displayed and updated
> continusly in real time
>
> I am also using databricks to duplicate it but I noticed the code link
> mentions
>
>  "If you count the number of rows in the table, you should find the value
> increasing over time. Run the following every few minutes."
> This leads me to believe that the version of Databricks that Micheal was
> using for the demo is still not released, or at-least the functionality to
> display those changes in real time aren't
>
> Is this the case? or am I completely wrong?
>
> Can I display the results of a structured streaming query in realtime
> using the databricks "display" function?
>
>
> Regards
> Sam
>
>

Re: Structured Streaming Spark Summit Demo - Databricks people

Posted by Chris Fregly <ch...@fregly.com>.
Just be warned:  the last time I asked a question about a non-working Databricks Keynote Demo from Spark Summit on the forum mentioned here, they deleted my question!  And i’m a major contributor to those forums!!

Often times, those on-stage demos don’t actually work until many months after they’re presented on stage - especially the proprietary demos involving dbutils() and display().

Chris Fregly
Research Scientist @ PipelineIO
Founder @ Advanced Spark and TensorFlow Meetup
San Francisco - Chicago - Washington DC - London

On Feb 15, 2017, 12:14 PM -0800, Nicholas Chammas <ni...@gmail.com>, wrote:
> I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product.
>
> Maybe this? https://forums.databricks.com/
>
> > On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin <hu...@gmail.com> wrote:
> > > Hey folks
> > >
> > > This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here
> > >
> > > My question is how did you get the results to be displayed and updated continusly in real time
> > >
> > > I am also using databricks to duplicate it but I noticed the code link mentions
> > >
> > >  "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
> > > This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't
> > >
> > > Is this the case? or am I completely wrong?
> > >
> > > Can I display the results of a structured streaming query in realtime using the databricks "display" function?
> > >
> > >
> > > Regards
> > > Sam

Re: Structured Streaming Spark Summit Demo - Databricks people

Posted by Nicholas Chammas <ni...@gmail.com>.
I don't think this is the right place for questions about Databricks. I'm
pretty sure they have their own website with a forum for questions about
their product.

Maybe this? https://forums.databricks.com/

On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin <hu...@gmail.com> wrote:

> Hey folks
>
> This one is mainly aimed at the databricks folks, I have been trying to
> replicate the cloudtrail demo
> <https://www.youtube.com/watch?v=IJmFTXvUZgY> Micheal did at Spark
> Summit. The code for it can be found here
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html>
>
> My question is how did you get the results to be displayed and updated
> continusly in real time
>
> I am also using databricks to duplicate it but I noticed the code link
> mentions
>
>  "If you count the number of rows in the table, you should find the value
> increasing over time. Run the following every few minutes."
> This leads me to believe that the version of Databricks that Micheal was
> using for the demo is still not released, or at-least the functionality to
> display those changes in real time aren't
>
> Is this the case? or am I completely wrong?
>
> Can I display the results of a structured streaming query in realtime
> using the databricks "display" function?
>
>
> Regards
> Sam
>

Re: Structured Streaming Spark Summit Demo - Databricks people

Posted by Sam Elamin <hu...@gmail.com>.
Thanks Micheal it really was a great demo

I figured I needed to add a trigger to display the results. But Buraz from
Databricks mentioned here
<https://forums.databricks.com/questions/10925/structured-streaming-in-real-time.html#comment-10929>
that the display on this functionality wont be available till potentially
the next release of databricks 2.1-db3

Ill take your points into account and try and duplicate it

Apologies if this isn't the forum for the question, im happy to take the
question offline but I genuinely believe the mailing list users might find
it very interesting

Happy to take the discussion offline though :)



On Thu, Feb 16, 2017 at 8:30 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Thanks for your interest in Apache Spark Structured Streaming!
>
> There is nothing secret in that demo, though I did make some configuration
> changes in order to get the timing right (gotta have some dramatic effect
> :) ).  Also I think the visualizations based on metrics output by the
> StreamingQueryListener
> <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html> are
> still being rolled out, but should be available everywhere soon.
>
> First, I set two options to make sure that files were read one at a time,
> thus allowing us to see incremental results.
>
> spark.readStream
>   .option("maxFilesPerTrigger", "1")
>   .option("latestFirst", "true")
> ...
>
> There is more detail on how these options work in this post
> <https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html>
> .
>
> Regarding continually updating result of a streaming query using
> display(df)for streaming DataFrames (i.e. one created with
> spark.readStream), that has worked in Databrick's since Spark 2.1.  The
> longer form example we published requires you to rerun the count to see it
> change at the end of the notebook because that is not a streaming query.
> Instead it is a batch query over data that has been written out by another
> stream.  I'd like to add the ability to run a streaming query from data
> that has been written out by the FileSink (tracked here SPARK-19633
> <https://issues.apache.org/jira/browse/SPARK-19633>).
>
> In the demo, I started two different streaming queries:
>  - one that reads from json / kafka => writes to parquet
>  - one that reads from json / kafka => writes to memory sink
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks>
> / pushes latest answer to the js running in a browser using the
> StreamingQueryListener
> <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>.
> This is packaged up nicely in display(), but there is nothing stopping
> you from building something similar with vanilla Apache Spark.
>
> Michael
>
>
> On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <hu...@gmail.com>
> wrote:
>
>> Hey folks
>>
>> This one is mainly aimed at the databricks folks, I have been trying to
>> replicate the cloudtrail demo
>> <https://www.youtube.com/watch?v=IJmFTXvUZgY> Micheal did at Spark
>> Summit. The code for it can be found here
>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html>
>>
>> My question is how did you get the results to be displayed and updated
>> continusly in real time
>>
>> I am also using databricks to duplicate it but I noticed the code link
>> mentions
>>
>>  "If you count the number of rows in the table, you should find the
>> value increasing over time. Run the following every few minutes."
>> This leads me to believe that the version of Databricks that Micheal was
>> using for the demo is still not released, or at-least the functionality to
>> display those changes in real time aren't
>>
>> Is this the case? or am I completely wrong?
>>
>> Can I display the results of a structured streaming query in realtime
>> using the databricks "display" function?
>>
>>
>> Regards
>> Sam
>>
>
>

Re: Structured Streaming Spark Summit Demo - Databricks people

Posted by Michael Armbrust <mi...@databricks.com>.
Thanks for your interest in Apache Spark Structured Streaming!

There is nothing secret in that demo, though I did make some configuration
changes in order to get the timing right (gotta have some dramatic effect
:) ).  Also I think the visualizations based on metrics output by the
StreamingQueryListener
<https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>
are
still being rolled out, but should be available everywhere soon.

First, I set two options to make sure that files were read one at a time,
thus allowing us to see incremental results.

spark.readStream
  .option("maxFilesPerTrigger", "1")
  .option("latestFirst", "true")
...

There is more detail on how these options work in this post
<https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html>
.

Regarding continually updating result of a streaming query using display(df)for
streaming DataFrames (i.e. one created with spark.readStream), that has
worked in Databrick's since Spark 2.1.  The longer form example we
published requires you to rerun the count to see it change at the end of
the notebook because that is not a streaming query. Instead it is a batch
query over data that has been written out by another stream.  I'd like to
add the ability to run a streaming query from data that has been written
out by the FileSink (tracked here SPARK-19633
<https://issues.apache.org/jira/browse/SPARK-19633>).

In the demo, I started two different streaming queries:
 - one that reads from json / kafka => writes to parquet
 - one that reads from json / kafka => writes to memory sink
<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks>
/ pushes latest answer to the js running in a browser using the
StreamingQueryListener
<https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>.
This is packaged up nicely in display(), but there is nothing stopping you
from building something similar with vanilla Apache Spark.

Michael


On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <hu...@gmail.com>
wrote:

> Hey folks
>
> This one is mainly aimed at the databricks folks, I have been trying to
> replicate the cloudtrail demo
> <https://www.youtube.com/watch?v=IJmFTXvUZgY> Micheal did at Spark
> Summit. The code for it can be found here
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html>
>
> My question is how did you get the results to be displayed and updated
> continusly in real time
>
> I am also using databricks to duplicate it but I noticed the code link
> mentions
>
>  "If you count the number of rows in the table, you should find the value
> increasing over time. Run the following every few minutes."
> This leads me to believe that the version of Databricks that Micheal was
> using for the demo is still not released, or at-least the functionality to
> display those changes in real time aren't
>
> Is this the case? or am I completely wrong?
>
> Can I display the results of a structured streaming query in realtime
> using the databricks "display" function?
>
>
> Regards
> Sam
>