You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Divya Gehlot <di...@gmail.com> on 2018/01/31 07:26:05 UTC

Spark Structured Streaming for Twitter Streaming data

Hi,
I am exploring the spark structured streaming .
When turned to internet to understand about it I could find its more
integrated with Kafka or other streaming tool like Kenesis.
I couldnt find where we can use Spark Streaming API for twitter streaming
data .
Would be grateful ,if any body used it or done some work or can guide me
Pardon me if I had understand it wrongly.

Thanks,
Divya

Re: Spark Structured Streaming for Twitter Streaming data

Posted by Divya Gehlot <di...@gmail.com>.
Got it Thanks for the clarification TD !

On Thu, 1 Feb 2018 at 11:36 AM, Tathagata Das <ta...@gmail.com>
wrote:

> The code uses the format "socket" which is only for text sent over a
> simple socket, which is completely different from how Twitter APIs works.
> So this wont work at all.
> Fundamentally, for Structured Streaming, we have focused only on those
> streaming sources that have the capabilities record-level tracking offsets
> (e.g. Kafka offsets) and replayability in order to give strong exactly-once
> fault-tolerance guarantees. Hence we have focused on files, Kafka, Kinesis
> (socket is just for testing as is documented). Twitter APIs as a source
> does not provide those, hence we have not focused on building one. In
> general, for such sources (ones that are not perfectly replayable), there
> are two possible solutions.
>
> 1. Build your own source: A quick google search shows that others in the
> community have attempted to build structured-streaming sources for Twitter.
> It wont provide the same fault-tolerance guarantees as Kafka, etc. However,
> I dont recommend this now because the DataSource APIs to build streaming
> sources are not public yet, and are in flux.
>
> 2. Use Kafka/Kinesis as an intermediate system: Write something simple
> that uses Twitter APIs directly to read tweets and write them into
> Kafka/Kinesis. And then just read from Kafka/Kinesis.
>
> Hope this helps.
>
> TD
>
> On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot <di...@gmail.com>
> wrote:
>
>> Hi ,
>> I see ,Does that means Spark structured streaming doesn't work with
>> Twitter streams ?
>> I could see people used kafka or other streaming tools and used spark to
>> process the data in structured streaming .
>>
>> The below doesn't work directly with Twitter Stream until I set up Kafka
>> ?
>>
>>> import org.apache.spark.sql.SparkSession
>>> val spark = SparkSession
>>>   .builder()
>>>   .appName("Spark SQL basic example")
>>>   .config("spark.some.config.option", "some-value")
>>>   .getOrCreate()
>>> // For implicit conversions like converting RDDs to DataFrames
>>> import spark.implicits
>>>>
>>>> / Read text from socket
>>>
>>> val socketDF = spark
>>>
>>>   .readStream
>>>
>>>   .format("socket")
>>>
>>>   .option("host", "localhost")
>>>
>>>   .option("port", 9999)
>>>
>>>   .load()
>>>
>>>
>>>> socketDF.isStreaming    // Returns True for DataFrames that have
>>>> streaming sources
>>>
>>>
>>>> socketDF.printSchema
>>>
>>>
>>>
>>
>>
>> Thanks,
>> Divya
>>
>> On 1 February 2018 at 10:30, Tathagata Das <ta...@gmail.com>
>> wrote:
>>
>>> Hello Divya,
>>>
>>> To add further clarification, the Apache Bahir does not have any
>>> Structured Streaming support for Twitter. It only has support for Twitter +
>>> DStreams.
>>>
>>> TD
>>>
>>>
>>>
>>> On Wed, Jan 31, 2018 at 2:44 AM, vermanurag <
>>> anurag.verma@fnmathlogic.com> wrote:
>>>
>>>> Twitter functionality is not part of Core Spark. We have successfully
>>>> used
>>>> the following packages from maven central in past
>>>>
>>>> org.apache.bahir:spark-streaming-twitter_2.11:2.2.0
>>>>
>>>> Earlier there used to be a twitter package under spark, but I find that
>>>> it
>>>> has not been updated beyond Spark 1.6
>>>> org.apache.spark:spark-streaming-twitter_2.11:1.6.0
>>>>
>>>> Anurag
>>>> www.fnmathlogic.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Spark Structured Streaming for Twitter Streaming data

Posted by Tathagata Das <ta...@gmail.com>.
The code uses the format "socket" which is only for text sent over a simple
socket, which is completely different from how Twitter APIs works. So this
wont work at all.
Fundamentally, for Structured Streaming, we have focused only on those
streaming sources that have the capabilities record-level tracking offsets
(e.g. Kafka offsets) and replayability in order to give strong exactly-once
fault-tolerance guarantees. Hence we have focused on files, Kafka, Kinesis
(socket is just for testing as is documented). Twitter APIs as a source
does not provide those, hence we have not focused on building one. In
general, for such sources (ones that are not perfectly replayable), there
are two possible solutions.

1. Build your own source: A quick google search shows that others in the
community have attempted to build structured-streaming sources for Twitter.
It wont provide the same fault-tolerance guarantees as Kafka, etc. However,
I dont recommend this now because the DataSource APIs to build streaming
sources are not public yet, and are in flux.

2. Use Kafka/Kinesis as an intermediate system: Write something simple that
uses Twitter APIs directly to read tweets and write them into
Kafka/Kinesis. And then just read from Kafka/Kinesis.

Hope this helps.

TD

On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot <di...@gmail.com>
wrote:

> Hi ,
> I see ,Does that means Spark structured streaming doesn't work with
> Twitter streams ?
> I could see people used kafka or other streaming tools and used spark to
> process the data in structured streaming .
>
> The below doesn't work directly with Twitter Stream until I set up Kafka  ?
>
>> import org.apache.spark.sql.SparkSession
>> val spark = SparkSession
>>   .builder()
>>   .appName("Spark SQL basic example")
>>   .config("spark.some.config.option", "some-value")
>>   .getOrCreate()
>> // For implicit conversions like converting RDDs to DataFrames
>> import spark.implicits
>>>
>>> / Read text from socket
>>
>> val socketDF = spark
>>
>>   .readStream
>>
>>   .format("socket")
>>
>>   .option("host", "localhost")
>>
>>   .option("port", 9999)
>>
>>   .load()
>>
>>
>>> socketDF.isStreaming    // Returns True for DataFrames that have
>>> streaming sources
>>
>>
>>> socketDF.printSchema
>>
>>
>>
>
>
> Thanks,
> Divya
>
> On 1 February 2018 at 10:30, Tathagata Das <ta...@gmail.com>
> wrote:
>
>> Hello Divya,
>>
>> To add further clarification, the Apache Bahir does not have any
>> Structured Streaming support for Twitter. It only has support for Twitter +
>> DStreams.
>>
>> TD
>>
>>
>>
>> On Wed, Jan 31, 2018 at 2:44 AM, vermanurag <anurag.verma@fnmathlogic.com
>> > wrote:
>>
>>> Twitter functionality is not part of Core Spark. We have successfully
>>> used
>>> the following packages from maven central in past
>>>
>>> org.apache.bahir:spark-streaming-twitter_2.11:2.2.0
>>>
>>> Earlier there used to be a twitter package under spark, but I find that
>>> it
>>> has not been updated beyond Spark 1.6
>>> org.apache.spark:spark-streaming-twitter_2.11:1.6.0
>>>
>>> Anurag
>>> www.fnmathlogic.com
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>

Re: Spark Structured Streaming for Twitter Streaming data

Posted by Divya Gehlot <di...@gmail.com>.
Hi ,
I see ,Does that means Spark structured streaming doesn't work with Twitter
streams ?
I could see people used kafka or other streaming tools and used spark to
process the data in structured streaming .

The below doesn't work directly with Twitter Stream until I set up Kafka  ?

> import org.apache.spark.sql.SparkSession
> val spark = SparkSession
>   .builder()
>   .appName("Spark SQL basic example")
>   .config("spark.some.config.option", "some-value")
>   .getOrCreate()
> // For implicit conversions like converting RDDs to DataFrames
> import spark.implicits
>>
>> / Read text from socket
>
> val socketDF = spark
>
>   .readStream
>
>   .format("socket")
>
>   .option("host", "localhost")
>
>   .option("port", 9999)
>
>   .load()
>
>
>> socketDF.isStreaming    // Returns True for DataFrames that have
>> streaming sources
>
>
>> socketDF.printSchema
>
>
>


Thanks,
Divya

On 1 February 2018 at 10:30, Tathagata Das <ta...@gmail.com>
wrote:

> Hello Divya,
>
> To add further clarification, the Apache Bahir does not have any
> Structured Streaming support for Twitter. It only has support for Twitter +
> DStreams.
>
> TD
>
>
>
> On Wed, Jan 31, 2018 at 2:44 AM, vermanurag <an...@fnmathlogic.com>
> wrote:
>
>> Twitter functionality is not part of Core Spark. We have successfully used
>> the following packages from maven central in past
>>
>> org.apache.bahir:spark-streaming-twitter_2.11:2.2.0
>>
>> Earlier there used to be a twitter package under spark, but I find that it
>> has not been updated beyond Spark 1.6
>> org.apache.spark:spark-streaming-twitter_2.11:1.6.0
>>
>> Anurag
>> www.fnmathlogic.com
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: Spark Structured Streaming for Twitter Streaming data

Posted by Tathagata Das <ta...@gmail.com>.
Hello Divya,

To add further clarification, the Apache Bahir does not have any Structured
Streaming support for Twitter. It only has support for Twitter + DStreams.

TD



On Wed, Jan 31, 2018 at 2:44 AM, vermanurag <an...@fnmathlogic.com>
wrote:

> Twitter functionality is not part of Core Spark. We have successfully used
> the following packages from maven central in past
>
> org.apache.bahir:spark-streaming-twitter_2.11:2.2.0
>
> Earlier there used to be a twitter package under spark, but I find that it
> has not been updated beyond Spark 1.6
> org.apache.spark:spark-streaming-twitter_2.11:1.6.0
>
> Anurag
> www.fnmathlogic.com
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark Structured Streaming for Twitter Streaming data

Posted by vermanurag <an...@fnmathlogic.com>.
Twitter functionality is not part of Core Spark. We have successfully used
the following packages from maven central in past

org.apache.bahir:spark-streaming-twitter_2.11:2.2.0

Earlier there used to be a twitter package under spark, but I find that it
has not been updated beyond Spark 1.6 
org.apache.spark:spark-streaming-twitter_2.11:1.6.0

Anurag
www.fnmathlogic.com  




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org