You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniil Osipov <da...@shazam.com> on 2014/09/02 23:13:18 UTC

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

What version of sbt are you using? There is a bug in early version of 0.13
that causes assembly to be extremely slow - make sure you're using the
latest one.


On Fri, Aug 29, 2014 at 1:30 PM, Aris <ar...@gmail.com> wrote:

> Hi folks,
>
> I am trying to use Kafka with Spark Streaming, and it appears I cannot do
> the normal 'sbt package' as I do with other Spark applications, such as
> Spark alone or Spark with MLlib. I learned I have to build with the
> sbt-assembly plugin.
>
> OK, so here is my build.sbt file for my extremely simple test Kafka/Spark
> Streaming project. It Takes almost 30 minutes to build! This is a Centos
> Linux machine on SSDs with 4GB of RAM, it's never been slow for me. To
> compare, sbt assembly for the entire Spark project itself takes less than
> 10 minutes.
>
> At the bottom of this file I am trying to play with 'cacheOutput' options,
> because I read online that maybe I am calculating SHA-1 for all the *.class
> files in this super JAR.
>
> I also copied the mergeStrategy from Spark contributor TD Spark Streaming
> tutorial from Spark Summit 2014.
>
> Again, is there some better way to build this JAR file, just using sbt
> package? This is process is working, but very slow.
>
> Any help with speeding up this compilation is really appreciated!!
>
> Aris
>
> -----------------------------------------
>
> import AssemblyKeys._ // put this at the top of the file
>
> name := "streamingKafka"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies ++= Seq(
>   "org.apache.spark" %% "spark-core" % "1.0.1" % "provided",
>   "org.apache.spark" %% "spark-streaming" % "1.0.1" % "provided",
>   "org.apache.spark" %% "spark-streaming-kafka" % "1.0.1"
> )
>
> assemblySettings
>
> jarName in assembly := "streamingkafka-assembly.jar"
>
> mergeStrategy in assembly := {
>   case m if m.toLowerCase.endsWith("manifest.mf")          =>
> MergeStrategy.discard
>   case m if m.toLowerCase.matches("meta-inf.*\\.sf$")      =>
> MergeStrategy.discard
>   case "log4j.properties"                                  =>
> MergeStrategy.discard
>   case m if m.toLowerCase.startsWith("meta-inf/services/") =>
> MergeStrategy.filterDistinctLines
>   case "reference.conf"                                    =>
> MergeStrategy.concat
>   case _                                                   =>
> MergeStrategy.first
> }
>
> assemblyOption in assembly ~= { _.copy(cacheOutput = false) }
>
>

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

Posted by Matt Narrell <ma...@gmail.com>.

I came across this:  https://github.com/xerial/sbt-pack

Until i found this, I was simply using the sbt-assembly plugin (sbt clean assembly)

mn

On Sep 4, 2014, at 2:46 PM, Aris <ar...@gmail.com> wrote:

> Thanks for answering Daniil - 
> 
> I have SBT version 0.13.5, is that an old version? Seems pretty up-to-date.
> 
> It turns out I figured out a way around this entire problem: just use 'sbt package', and when using bin/spark-submit, pass it the "--jars" option and GIVE IT ALL THE JARS from the local iv2 cache. Pretty inelegant, but at least I am able to develop, and when I want to make a super JAR with sbt assembly I can use the stupidly slow method.
> 
> Here is the important snippet for grabbing all the JARs for the local cache of ivy2 :
> 
>  --jars $(find ~/.ivy2/cache/ -iname *.jar | tr '\n' ,) 
> 
> Here's the entire running command  - 
> 
> bin/spark-submit --master local[*] --jars $(find /home/data/.ivy2/cache/ -iname *.jar | tr '\n' ,) --class KafkaStreamConsumer ~/code_host/data/scala/streamingKafka/target/scala-2.10/streamingkafka_2.10-1.0.jar node1:2181 my-consumer-group aris-topic 1
> 
> This is fairly bad, but it works around sbt assembly being incredibly slow
> 
> 
> On Tue, Sep 2, 2014 at 2:13 PM, Daniil Osipov <da...@shazam.com> wrote:
> What version of sbt are you using? There is a bug in early version of 0.13 that causes assembly to be extremely slow - make sure you're using the latest one.
> 
> 
> On Fri, Aug 29, 2014 at 1:30 PM, Aris <> wrote:
> Hi folks,
> 
> I am trying to use Kafka with Spark Streaming, and it appears I cannot do the normal 'sbt package' as I do with other Spark applications, such as Spark alone or Spark with MLlib. I learned I have to build with the sbt-assembly plugin.
> 
> OK, so here is my build.sbt file for my extremely simple test Kafka/Spark Streaming project. It Takes almost 30 minutes to build! This is a Centos Linux machine on SSDs with 4GB of RAM, it's never been slow for me. To compare, sbt assembly for the entire Spark project itself takes less than 10 minutes.
> 
> At the bottom of this file I am trying to play with 'cacheOutput' options, because I read online that maybe I am calculating SHA-1 for all the *.class files in this super JAR. 
> 
> I also copied the mergeStrategy from Spark contributor TD Spark Streaming tutorial from Spark Summit 2014.
> 
> Again, is there some better way to build this JAR file, just using sbt package? This is process is working, but very slow.
> 
> Any help with speeding up this compilation is really appreciated!!
> 
> Aris
> 
> -----------------------------------------
> 
> import AssemblyKeys._ // put this at the top of the file
> 
> name := "streamingKafka"
> 
> version := "1.0"
> 
> scalaVersion := "2.10.4"
> 
> libraryDependencies ++= Seq(
>   "org.apache.spark" %% "spark-core" % "1.0.1" % "provided",
>   "org.apache.spark" %% "spark-streaming" % "1.0.1" % "provided",
>   "org.apache.spark" %% "spark-streaming-kafka" % "1.0.1"
> )
> 
> assemblySettings
> 
> jarName in assembly := "streamingkafka-assembly.jar"
> 
> mergeStrategy in assembly := {
>   case m if m.toLowerCase.endsWith("manifest.mf")          => MergeStrategy.discard
>   case m if m.toLowerCase.matches("meta-inf.*\\.sf$")      => MergeStrategy.discard
>   case "log4j.properties"                                  => MergeStrategy.discard
>   case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
>   case "reference.conf"                                    => MergeStrategy.concat
>   case _                                                   => MergeStrategy.first
> }
> 
> assemblyOption in assembly ~= { _.copy(cacheOutput = false) }
> 
> 
>

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

Posted by Aris <ar...@gmail.com>.

Thanks for answering Daniil -

I have SBT version 0.13.5, is that an old version? Seems pretty up-to-date.

It turns out I figured out a way around this entire problem: just use 'sbt
package', and when using bin/spark-submit, pass it the "--jars" option and
GIVE IT ALL THE JARS from the local iv2 cache. Pretty inelegant, but at
least I am able to develop, and when I want to make a super JAR with sbt
assembly I can use the stupidly slow method.

Here is the important snippet for grabbing all the JARs for the local cache
of ivy2 :

 --jars $(find ~/.ivy2/cache/ -iname *.jar | tr '\n' ,)

Here's the entire running command  -

bin/spark-submit --master local[*] --jars $(find /home/data/.ivy2/cache/
-iname *.jar | tr '\n' ,) --class KafkaStreamConsumer
~/code_host/data/scala/streamingKafka/target/scala-2.10/streamingkafka_2.10-1.0.jar
node1:2181 my-consumer-group aris-topic 1

This is fairly bad, but it works around sbt assembly being incredibly slow


On Tue, Sep 2, 2014 at 2:13 PM, Daniil Osipov <da...@shazam.com>
wrote:

> What version of sbt are you using? There is a bug in early version of 0.13
> that causes assembly to be extremely slow - make sure you're using the
> latest one.
>
>
> On Fri, Aug 29, 2014 at 1:30 PM, Aris <> wrote:
>
>> Hi folks,
>>
>> I am trying to use Kafka with Spark Streaming, and it appears I cannot do
>> the normal 'sbt package' as I do with other Spark applications, such as
>> Spark alone or Spark with MLlib. I learned I have to build with the
>> sbt-assembly plugin.
>>
>> OK, so here is my build.sbt file for my extremely simple test Kafka/Spark
>> Streaming project. It Takes almost 30 minutes to build! This is a Centos
>> Linux machine on SSDs with 4GB of RAM, it's never been slow for me. To
>> compare, sbt assembly for the entire Spark project itself takes less than
>> 10 minutes.
>>
>> At the bottom of this file I am trying to play with 'cacheOutput'
>> options, because I read online that maybe I am calculating SHA-1 for all
>> the *.class files in this super JAR.
>>
>> I also copied the mergeStrategy from Spark contributor TD Spark Streaming
>> tutorial from Spark Summit 2014.
>>
>> Again, is there some better way to build this JAR file, just using sbt
>> package? This is process is working, but very slow.
>>
>> Any help with speeding up this compilation is really appreciated!!
>>
>> Aris
>>
>> -----------------------------------------
>>
>> import AssemblyKeys._ // put this at the top of the file
>>
>> name := "streamingKafka"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> libraryDependencies ++= Seq(
>>   "org.apache.spark" %% "spark-core" % "1.0.1" % "provided",
>>   "org.apache.spark" %% "spark-streaming" % "1.0.1" % "provided",
>>   "org.apache.spark" %% "spark-streaming-kafka" % "1.0.1"
>> )
>>
>> assemblySettings
>>
>> jarName in assembly := "streamingkafka-assembly.jar"
>>
>> mergeStrategy in assembly := {
>>   case m if m.toLowerCase.endsWith("manifest.mf")          =>
>> MergeStrategy.discard
>>   case m if m.toLowerCase.matches("meta-inf.*\\.sf$")      =>
>> MergeStrategy.discard
>>   case "log4j.properties"                                  =>
>> MergeStrategy.discard
>>   case m if m.toLowerCase.startsWith("meta-inf/services/") =>
>> MergeStrategy.filterDistinctLines
>>   case "reference.conf"                                    =>
>> MergeStrategy.concat
>>   case _                                                   =>
>> MergeStrategy.first
>> }
>>
>> assemblyOption in assembly ~= { _.copy(cacheOutput = false) }
>>
>>
>