You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by tencas <di...@gmail.com> on 2017/04/15 20:12:09 UTC

Join streams Apache Spark

Hi everybody,

 I am using Apache Spark Streaming using a TCP connector to receive data.
I have a python application that connects to a sensor, and create a TCP
server that waits connection from Apache Spark, and then, sends json data
through this socket.

How can I manage to join many independent sensors sources to send data to
the same receiver on Apache Spark? 

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Join streams Apache Spark

Posted by saulshanabrook <s....@gmail.com>.

I, actually, just ran it in a Docker image. But the point is, it doesn't
need to run in the JVM, because it just runs as a separate process. Then
your Java (or any other client) code sends messages to it over TCP and it
relays them to Spark.

On Mon, May 8, 2017 at 4:07 AM tencas [via Apache Spark User List] <
ml+s1001560n28662h87@n3.nabble.com> wrote:

> Yep, I mean the first script you posted. So, you can compile it to Java
> binaries for example ? Ok, I have no idea about Go.
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28662.html
> To unsubscribe from Join streams Apache Spark, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28603&code=cy5zaGFuYWJyb29rQGdtYWlsLmNvbXwyODYwM3wtMTIyOTI1NzM1Ng==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Join streams Apache Spark

Posted by tencas <di...@gmail.com>.

Yep, I mean the first script you posted. So, you can compile it to Java
binaries for example ? Ok, I have no idea about Go.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Join streams Apache Spark

Posted by saulshanabrook <s....@gmail.com>.

The script I wrote in Go? No I do not, but it's very easy to compile it to
whatever platform you are running on! Doesn't need to be integrated in the
same language as the rest of your code.

On Sat, May 6, 2017 at 3:13 PM tencas [via Apache Spark User List] <
ml+s1001560n28658h63@n3.nabble.com> wrote:

> There exists an Spark Streaming example of the classic word count, using
> apache kafka connector:
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>
> (maybe you already know)
>
> The point is, what are the benefits from using Kafka, instead of a lighter
> solution like yours. Maybe anybody could help us. Anyway, when I try it
> out, I'll give you feedback.
>
> On the other hand, have you got ,by any chance, the same script written on
> Scala, Phyton or Java ?
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28658.html
> To unsubscribe from Join streams Apache Spark, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28603&code=cy5zaGFuYWJyb29rQGdtYWlsLmNvbXwyODYwM3wtMTIyOTI1NzM1Ng==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Join streams Apache Spark

Posted by tencas <di...@gmail.com>.

There exists an Spark Streaming example of the classic word count, using
apache kafka connector:

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

(maybe you already know)

The point is, what are the benefits from using Kafka, instead of a lighter
solution like yours. Maybe anybody could help us. Anyway, when I try it out,
I'll give you feedback.

On the other hand, have you got ,by any chance, the same script written on
Scala, Phyton or Java ?





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Join streams Apache Spark

Posted by saulshanabrook <s....@gmail.com>.

Would love to hear if you try it out. I was also considering that. I
recently changed to using the file based streaming input. I made another Go
script
<https://github.com/saulshanabrook/ici.recorder/blob/fd8110e490691cc9e98dce1fefbddba973c29deb/server/main.go>
that let's me connect over TCP and writes each newline it receives to a new
file in a folder. Then Spark can read them from that folder.

On Sat, May 6, 2017 at 2:38 PM tencas [via Apache Spark User List] <
ml+s1001560n28656h1@n3.nabble.com> wrote:

> Thanks @saulshanabrook, I'll have a look at it.
>
> I think apache kafka could be an alternative solution, but I haven't
> checked it yet.
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28656.html
> To unsubscribe from Join streams Apache Spark, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28603&code=cy5zaGFuYWJyb29rQGdtYWlsLmNvbXwyODYwM3wtMTIyOTI1NzM1Ng==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Join streams Apache Spark

Posted by tencas <di...@gmail.com>.

Thanks @saulshanabrook, I'll have a look at it.

I think apache kafka could be an alternative solution, but I haven't checked
it yet.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Join streams Apache Spark

Posted by saulshanabrook <s....@gmail.com>.

I  wrote a server in Go
<https://gist.github.com/saulshanabrook/ce65baf5d82460b655e1232b9fe796d3>  
that allows many TCP connections for incoming data on one port, writing each
line to the client listening on another port. The  environmental variable
set's what port client's should connect to, to send data to Spark (the
sensors in your case) and the  sets the port that Spark should connect to,
to listen for data.

If anyone knows a simpler way of doing this, by using some existing
software, I would love to know about it.

If you are interested in this code, I would be happy to clean it up and
release it with some documentation.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Join streams Apache Spark

Posted by Gourav Sengupta <go...@gmail.com>.

On another note, you might want to first try Flume in case you are just at
exploration phase. The advantage of Flume (using push) is that you do not
need to write any additional program in order to sink or write your data to
any target system. I am not quite sure how well Flume works with SPARK
streaming (theoretically it should)

On other hand Kafka and its integration with SPARK is mentioned here
https://docs.databricks.com/spark/latest/structured-streaming/kafka.html

Regards,
Gourav Sengupta

On Sat, Apr 15, 2017 at 9:12 PM, tencas <di...@gmail.com> wrote:

> Hi everybody,
>
>  I am using Apache Spark Streaming using a TCP connector to receive data.
> I have a python application that connects to a sensor, and create a TCP
> server that waits connection from Apache Spark, and then, sends json data
> through this socket.
>
> How can I manage to join many independent sensors sources to send data to
> the same receiver on Apache Spark?
>
> Thanks.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Join streams Apache Spark

Posted by tencas <di...@gmail.com>.

Hi scorpio,

 thanks for your reply. 
I don't understand your approach. Is it possible to receive data from
different clients throught the same port on Spark?

Surely I'm confused and I'd appreciate your opinion.

Regarding the word count example , from Spark Streaming documentation, Spark
acts as a client that connects to a remote server, in order te receive data:

/// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost",
9999);/

Then, you create a dummy server using nc receive connections request from
spark, and to send data:

/nc -lk 9999/

So, regarding this implementation, as spark is playing the role of tcp
client. you'd need to manage the join of external sensors streams (by the
way, all with the same schema) in your own server.
How would you be able to make Spark acts as a "sink" that can receive
different sources stream throught the same port??








--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28670.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org