You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mike Trienis <mi...@orcsol.com> on 2015/03/18 19:44:25 UTC

Spark Streaming S3 Performance Implications

Hi All,

I am pushing data from Kinesis stream to S3 using Spark Streaming and
noticed that during testing (i.e. master=local[2]) the batches (1 second
intervals) were falling behind the incoming data stream at about 5-10
events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
at a few seconds to complete.

        val saveFunc = (rdd: RDD[String], time: Time) => {

            val count = rdd.count()

            if (count > 0) {

                val s3BucketInterval = time.milliseconds.toString

               rdd.saveAsTextFile(s3n://...)

            }
        }

        dataStream.foreachRDD(saveFunc)


Should I expect the same behaviour in a deployed cluster? Or does the
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?

"Write the elements of the dataset as a text file (or set of text files) in
a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element to
convert it to a line of text in the file."

Thanks, Mike.

Re: Spark Streaming S3 Performance Implications

Posted by Mike Trienis <mi...@orcsol.com>.
Hey Chris,

Apologies for the delayed reply. Your responses are always insightful and
appreciated :-)

However, I have a few more questions.

"also, it looks like you're writing to S3 per RDD.  you'll want to broaden
that out to write DStream batches"

I assume you mean "dstream.saveAsTextFiles(....)" vs
"rdd.saveAsTextFile(....)". Although looking at the source code
DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile

  def saveAsTextFiles(prefix: String, suffix: String = "") {
    val saveFunc = (rdd: RDD[T], time: Time) => {
      val file = rddToFileName(prefix, suffix, time)
      rdd.saveAsTextFile(file)
    }
    this.foreachRDD(saveFunc)
  }

So it's not clear to me how it would improve the throughput.

Also, for your comment "expand even further and write window batches (where
the window interval is a multiple of the batch interval)". I still don't
quite understand mechanics underneath, for example, what would be the
difference between extending the batch interval and adding windowed
batches? I presume it has something to do with the processor thread(s)
within a receiver?

By the way, in the near future, I'll be putting together some performance
numbers on a proper deployment, and will be sure to share my findings.

Thanks, Mike!



On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <ch...@fregly.com> wrote:

> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD.  you'll want to broaden
> that out to write DStream batches - or expand  even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _____________________________
> From: Mike Trienis <mi...@orcsol.com>
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: <us...@spark.apache.org>
>
>
>
>  Hi All,
>
>  I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
>           val saveFunc = (rdd: RDD[String], time: Time) => {
>
>              val count = rdd.count()
>
>              if (count > 0) {
>
>                  val s3BucketInterval = time.milliseconds.toString
>
>                 rdd.saveAsTextFile(s3n://...)
>
>              }
>          }
>
>          dataStream.foreachRDD(saveFunc)
>
>
>  Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
>  "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
>  Thanks, Mike.
>
>
>

Re: Spark Streaming S3 Performance Implications

Posted by Ted Yu <yu...@gmail.com>.
Mike:
Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced
performance of s3a.
See HADOOP-11571

Cheers

On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <ch...@fregly.com> wrote:

> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD.  you'll want to broaden
> that out to write DStream batches - or expand  even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _____________________________
> From: Mike Trienis <mi...@orcsol.com>
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: <us...@spark.apache.org>
>
>
>
>  Hi All,
>
>  I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
>           val saveFunc = (rdd: RDD[String], time: Time) => {
>
>              val count = rdd.count()
>
>              if (count > 0) {
>
>                  val s3BucketInterval = time.milliseconds.toString
>
>                 rdd.saveAsTextFile(s3n://...)
>
>              }
>          }
>
>          dataStream.foreachRDD(saveFunc)
>
>
>  Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
>  "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
>  Thanks, Mike.
>
>
>

Re: Spark Streaming S3 Performance Implications

Posted by Chris Fregly <ch...@fregly.com>.
hey mike!
you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount example that is included with the kinesis streaming package. 
you'll need more cores (cluster) or threads (local) to support this - equalling at least the number of shards/receivers + 1.
also, it looks like you're writing to S3 per RDD.  you'll want to broaden that out to write DStream batches - or expand  even further and write window batches (where the window interval is a multiple of the batch interval).
this goes for any spark streaming implementation - not just Kinesis.
lemme know if that works for you.
thanks!
-Chris 
    _____________________________
From: Mike Trienis <mi...@orcsol.com>
Sent: Wednesday, March 18, 2015 2:45 PM
Subject: Spark Streaming S3 Performance Implications
To:  <us...@spark.apache.org>


       Hi All,       
          I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few seconds to complete.           
                       val saveFunc = (rdd: RDD[String], time: Time) => {             
                         val count = rdd.count()             
                         if (count > 0) {             
                             val s3BucketInterval = time.milliseconds.toString             
                            rdd.saveAsTextFile(s3n://...)                                       }                     }             
                     dataStream.foreachRDD(saveFunc)              
          
          Should I expect the same behaviour in a deployed cluster? Or does the rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?          
          "Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file."          
          Thanks, Mike.