You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nilesh Chakraborty <ni...@nileshc.com> on 2014/06/10 14:05:52 UTC

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

Hello!

Spark Streaming supports HDFS as input source, and also Akka actor
receivers, or TCP socket receivers.

For my use case I think it's probably more convenient to read the data
directly from Actors, because I already need to set up a multi-node Akka
cluster (on the same nodes that Spark runs on) and write some actors to
perform some parallel operations. Writing actor receivers to consume the
results of my business-logic actors and then feed into Spark is pretty
seamless. Note that the actors generate a large amount of data (a few GBs to
tens of GBs).

The other option would be to setup HDFS on the same cluster as Spark, write
the data from the Actors to HDFS, and then use HDFS as input source for
Spark Streaming. Does this result in better performance due to data locality
(with HDFS data replication turned on)? I think performance should be almost
the same with actors, since Spark workers local to the worker actors should
get the data fast, and some optimization like this is definitely done I
assume?

I suppose the only benefit with HDFS would be better fault tolerance, and
the ability to checkpoint and recover even if master fails.

Cheers,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

Posted by Nilesh Chakraborty <ni...@nileshc.com>.
Hey Michael,

Thanks for the great reply! That clears things up a lot. The idea about
Apache Kafka sounds very interesting; I'll look into it. The multiple
consumers and fault tolerance sound awesome. That's probably what I need.

Cheers,
Nilesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317p7320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

Posted by Michael Cutler <mi...@tumra.com>.
Hey Nilesh,

Great to hear your using Spark Streaming, in my opinion the crux of your
question comes down to what you want to do with the data in the future
and/or if there is utility it using it from more than one Spark/Streaming
job.

1). *One-time-use fire and forget *- as you rightly point out, hooking up
to the Akka actors makes sense if the usefulness of the data is short-lived
and you don't need the ability to readily go back into archived data.

2). *Fault tolerance & multiple uses* - consider using a message queue like
Apache Kafka [1], write messages from your Akka Actors into a Kafka topic
with multiple partitions and replication.  Then use Spark Streaming job(s)
to read from Kafka.  You can tune Kafka to keep the last *N* days data
online so if your Spark Streaming job dies it can pickup at the point it
left off.

3). *Keep indefinitely* - files in HDFS, 'nuff said.

We're currently using (2) Kafka & (3) HDFS to process around 400M "web
clickstream events" a week.  Everything is written into Kafka and kept
'online' for 7 days, and also written out to HDFS in compressed
date-sequential files.

We use several Spark Streaming jobs to process the real-time events
straight from Kafka.  Kafka supports multiple consumers so each job sees
his own view of the message queue and all its events.  If any of the
Streaming jobs die or are restarted they continue consuming from Kafka from
the last processed message without effecting any of the other consumer
processes.

Best,

MC


[1] http://kafka.apache.org/



On 10 June 2014 13:05, Nilesh Chakraborty <ni...@nileshc.com> wrote:

> Hello!
>
> Spark Streaming supports HDFS as input source, and also Akka actor
> receivers, or TCP socket receivers.
>
> For my use case I think it's probably more convenient to read the data
> directly from Actors, because I already need to set up a multi-node Akka
> cluster (on the same nodes that Spark runs on) and write some actors to
> perform some parallel operations. Writing actor receivers to consume the
> results of my business-logic actors and then feed into Spark is pretty
> seamless. Note that the actors generate a large amount of data (a few GBs
> to
> tens of GBs).
>
> The other option would be to setup HDFS on the same cluster as Spark, write
> the data from the Actors to HDFS, and then use HDFS as input source for
> Spark Streaming. Does this result in better performance due to data
> locality
> (with HDFS data replication turned on)? I think performance should be
> almost
> the same with actors, since Spark workers local to the worker actors should
> get the data fast, and some optimization like this is definitely done I
> assume?
>
> I suppose the only benefit with HDFS would be better fault tolerance, and
> the ability to checkpoint and recover even if master fails.
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>