You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Patrick Wendell <pw...@gmail.com> on 2014/04/29 10:05:38 UTC

Spark 1.0.0 rc3

Hey All,

This is not an official vote, but I wanted to cut an RC so that people can
test against the Maven artifacts, test building with their configuration,
etc. We are still chasing down a few issues and updating docs, etc.

If you have issues or bug reports for this release, please send an e-mail
to the Spark dev list and/or file a JIRA.

Commit: d636772 (v1.0.0-rc3)
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221

Binaries:
http://people.apache.org/~pwendell/spark-1.0.0-rc3/

Docs:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/

Repository:
https://repository.apache.org/content/repositories/orgapachespark-1012/

== API Changes ==
If you want to test building against Spark there are some minor API
changes. We'll get these written up for the final release but I'm noting a
few here (not comprehensive):

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Streaming classes have been renamed:
NetworkReceiver -> Receiver

Re: Spark 1.0.0 rc3

Posted by Patrick Wendell <pw...@gmail.com>.
Sorry got cut off. For 0.9.0 and 1.0.0 they are not binary compatible
and in a few cases not source compatible. 1.X will be source
compatible. We are also planning to support binary compatibility in
1.X but I'm waiting util we make a few releases to officially promise
that, since Scala makes this pretty tricky.

On Tue, Apr 29, 2014 at 11:47 AM, Patrick Wendell <pw...@gmail.com> wrote:
>> What are the expectations / guarantees on binary compatibility between
>> 0.9 and 1.0?
>
> There are not guarantees.

Re: Spark 1.0.0 rc3

Posted by Patrick Wendell <pw...@gmail.com>.
> What are the expectations / guarantees on binary compatibility between
> 0.9 and 1.0?

There are not guarantees.

Re: Spark 1.0.0 rc3

Posted by Marcelo Vanzin <va...@cloudera.com>.
Hi Patrick,

What are the expectations / guarantees on binary compatibility between
0.9 and 1.0?

You mention some API changes, which kinda hint that binary
compatibility has already been broken, but just wanted to point out
there are other cases. e.g.:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:236)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoSuchMethodError:
org.apache.spark.SparkContext$.rddToOrderedRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/Function1;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/OrderedRDDFunctions;

(Compiled against 0.9, run against 1.0.)
Offending code:

      val top10 = counts.sortByKey(false).take(10)

Recompiling fixes the problem.


On Tue, Apr 29, 2014 at 1:05 AM, Patrick Wendell <pw...@gmail.com> wrote:
> Hey All,
>
> This is not an official vote, but I wanted to cut an RC so that people can
> test against the Maven artifacts, test building with their configuration,
> etc. We are still chasing down a few issues and updating docs, etc.
>
> If you have issues or bug reports for this release, please send an e-mail
> to the Spark dev list and/or file a JIRA.
>
> Commit: d636772 (v1.0.0-rc3)
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>
> Binaries:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>
> Docs:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>
> Repository:
> https://repository.apache.org/content/repositories/orgapachespark-1012/
>
> == API Changes ==
> If you want to test building against Spark there are some minor API
> changes. We'll get these written up for the final release but I'm noting a
> few here (not comprehensive):
>
> changes to ML vector specification:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>
> Streaming classes have been renamed:
> NetworkReceiver -> Receiver



-- 
Marcelo

Re: Spark 1.0.0 rc3

Posted by Manu Suryavansh <su...@gmail.com>.
Hi,

I tried to build the 1.0.0 rc3 version with Java 8 and I got the error
- java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC
overhead limit exceeded
I am building on a Core-i7(Quad core) windows laptop with 8 GB RAM.

Earlier I had tried to build Spark 0.9.1 with Java 8 and I had gotten an
error about comparator.class not found - which was mentioned today on
another thread, so I am not getting that error now. I have successfully
build Spark 0.9.0 with Java 1.7.

[image: Inline image 1]

Thanks,
Manu


On Tue, Apr 29, 2014 at 10:43 PM, Patrick Wendell <pw...@gmail.com>wrote:

> That suggestion got lost along the way and IIRC the patch didn't have
> that. It's a good idea though, if nothing else to provide a simple
> means for backwards compatibility.
>
> I created a JIRA for this. It's very straightforward so maybe someone
> can pick it up quickly:
> https://issues.apache.org/jira/browse/SPARK-1677
>
>
> On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler <de...@gmail.com>
> wrote:
> > Thanks. I'm fine with the logic change, although I was a bit surprised to
> > see Hadoop used for file I/O.
> >
> > Anyway, the jira issue and pull request discussions mention a flag to
> > enable overwrites. That would be very convenient for a tutorial I'm
> > writing, although I wouldn't recommend it for normal use, of course.
> > However, I can't figure out if this actually exists. I found the
> > spark.files.overwrite property, but that doesn't apply.  Does this
> override
> > flag, method call, or method argument actually exist?
> >
> > Thanks,
> > Dean
> >
> >
> > On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >
> >> Hi Dean,
> >>
> >> We always used the Hadoop libraries here to read and write local
> >> files. In Spark 1.0 we started enforcing the rule that you can't
> >> over-write an existing directory because it can cause
> >> confusing/undefined behavior if multiple jobs output to the directory
> >> (they partially clobber each other's output).
> >>
> >> https://issues.apache.org/jira/browse/SPARK-1100
> >> https://github.com/apache/spark/pull/11
> >>
> >> In the JIRA I actually proposed slightly deviating from Hadoop
> >> semantics and allowing the directory to exist if it is empty, but I
> >> think in the end we decided to just go with the exact same semantics
> >> as Hadoop (i.e. empty directories are a problem).
> >>
> >> - Patrick
> >>
> >> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler <de...@gmail.com>
> >> wrote:
> >> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
> >> using
> >> > HDFS classes for file I/O, while the same script compiled and running
> >> with
> >> > 0.9.1 uses only the local-mode File IO.
> >> >
> >> > The script is a variation of the Word Count script. Here are the
> "guts":
> >> >
> >> > object WordCount2 {
> >> >   def main(args: Array[String]) = {
> >> >
> >> >     val sc = new SparkContext("local", "Word Count (2)")
> >> >
> >> >     val input = sc.textFile(".../some/local/file").map(line =>
> >> > line.toLowerCase)
> >> >     input.cache
> >> >
> >> >     val wc2 = input
> >> >       .flatMap(line => line.split("""\W+"""))
> >> >       .map(word => (word, 1))
> >> >       .reduceByKey((count1, count2) => count1 + count2)
> >> >
> >> >     wc2.saveAsTextFile("output/some/directory")
> >> >
> >> >     sc.stop()
> >> >
> >> > It works fine compiled and executed with 0.9.1. If I recompile and run
> >> with
> >> > 1.0.0-RC1, where the same output directory still exists, I get this
> >> > familiar Hadoop-ish exception:
> >> >
> >> > [error] (run-main-0)
> org.apache.hadoop.mapred.FileAlreadyExistsException:
> >> > Output directory
> >> >
> >>
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> >> > already exists
> >> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> >> >
> >>
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> >> > already exists
> >> >  at
> >> >
> >>
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
> >> >  at
> >> >
> >>
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
> >> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> >> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
> >> >  at spark.activator.WordCount2.main(WordCount2.scala)
> >> > ...
> >> >
> >> > Thoughts?
> >> >
> >> >
> >> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >> >
> >> >> Hey All,
> >> >>
> >> >> This is not an official vote, but I wanted to cut an RC so that
> people
> >> can
> >> >> test against the Maven artifacts, test building with their
> >> configuration,
> >> >> etc. We are still chasing down a few issues and updating docs, etc.
> >> >>
> >> >> If you have issues or bug reports for this release, please send an
> >> e-mail
> >> >> to the Spark dev list and/or file a JIRA.
> >> >>
> >> >> Commit: d636772 (v1.0.0-rc3)
> >> >>
> >> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
> >> >>
> >> >> Binaries:
> >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
> >> >>
> >> >> Docs:
> >> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
> >> >>
> >> >> Repository:
> >> >>
> https://repository.apache.org/content/repositories/orgapachespark-1012/
> >> >>
> >> >> == API Changes ==
> >> >> If you want to test building against Spark there are some minor API
> >> >> changes. We'll get these written up for the final release but I'm
> >> noting a
> >> >> few here (not comprehensive):
> >> >>
> >> >> changes to ML vector specification:
> >> >>
> >> >>
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
> >> >>
> >> >> changes to the Java API:
> >> >>
> >> >>
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> >> >>
> >> >> coGroup and related functions now return Iterable[T] instead of
> Seq[T]
> >> >> ==> Call toSeq on the result to restore the old behavior
> >> >>
> >> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> >> >> ==> Call toSeq on the result to restore old behavior
> >> >>
> >> >> Streaming classes have been renamed:
> >> >> NetworkReceiver -> Receiver
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Dean Wampler, Ph.D.
> >> > Typesafe
> >> > @deanwampler
> >> > http://typesafe.com
> >> > http://polyglotprogramming.com
> >>
> >
> >
> >
> > --
> > Dean Wampler, Ph.D.
> > Typesafe
> > @deanwampler
> > http://typesafe.com
> > http://polyglotprogramming.com
>



-- 
Manu Suryavansh

Re: Spark 1.0.0 rc3

Posted by Patrick Wendell <pw...@gmail.com>.
That suggestion got lost along the way and IIRC the patch didn't have
that. It's a good idea though, if nothing else to provide a simple
means for backwards compatibility.

I created a JIRA for this. It's very straightforward so maybe someone
can pick it up quickly:
https://issues.apache.org/jira/browse/SPARK-1677


On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler <de...@gmail.com> wrote:
> Thanks. I'm fine with the logic change, although I was a bit surprised to
> see Hadoop used for file I/O.
>
> Anyway, the jira issue and pull request discussions mention a flag to
> enable overwrites. That would be very convenient for a tutorial I'm
> writing, although I wouldn't recommend it for normal use, of course.
> However, I can't figure out if this actually exists. I found the
> spark.files.overwrite property, but that doesn't apply.  Does this override
> flag, method call, or method argument actually exist?
>
> Thanks,
> Dean
>
>
> On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hi Dean,
>>
>> We always used the Hadoop libraries here to read and write local
>> files. In Spark 1.0 we started enforcing the rule that you can't
>> over-write an existing directory because it can cause
>> confusing/undefined behavior if multiple jobs output to the directory
>> (they partially clobber each other's output).
>>
>> https://issues.apache.org/jira/browse/SPARK-1100
>> https://github.com/apache/spark/pull/11
>>
>> In the JIRA I actually proposed slightly deviating from Hadoop
>> semantics and allowing the directory to exist if it is empty, but I
>> think in the end we decided to just go with the exact same semantics
>> as Hadoop (i.e. empty directories are a problem).
>>
>> - Patrick
>>
>> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler <de...@gmail.com>
>> wrote:
>> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
>> using
>> > HDFS classes for file I/O, while the same script compiled and running
>> with
>> > 0.9.1 uses only the local-mode File IO.
>> >
>> > The script is a variation of the Word Count script. Here are the "guts":
>> >
>> > object WordCount2 {
>> >   def main(args: Array[String]) = {
>> >
>> >     val sc = new SparkContext("local", "Word Count (2)")
>> >
>> >     val input = sc.textFile(".../some/local/file").map(line =>
>> > line.toLowerCase)
>> >     input.cache
>> >
>> >     val wc2 = input
>> >       .flatMap(line => line.split("""\W+"""))
>> >       .map(word => (word, 1))
>> >       .reduceByKey((count1, count2) => count1 + count2)
>> >
>> >     wc2.saveAsTextFile("output/some/directory")
>> >
>> >     sc.stop()
>> >
>> > It works fine compiled and executed with 0.9.1. If I recompile and run
>> with
>> > 1.0.0-RC1, where the same output directory still exists, I get this
>> > familiar Hadoop-ish exception:
>> >
>> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
>> > Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
>> >
>> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
>> > already exists
>> >  at
>> >
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>> >  at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
>> > at
>> >
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
>> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
>> >  at spark.activator.WordCount2.main(WordCount2.scala)
>> > ...
>> >
>> > Thoughts?
>> >
>> >
>> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>> >
>> >> Hey All,
>> >>
>> >> This is not an official vote, but I wanted to cut an RC so that people
>> can
>> >> test against the Maven artifacts, test building with their
>> configuration,
>> >> etc. We are still chasing down a few issues and updating docs, etc.
>> >>
>> >> If you have issues or bug reports for this release, please send an
>> e-mail
>> >> to the Spark dev list and/or file a JIRA.
>> >>
>> >> Commit: d636772 (v1.0.0-rc3)
>> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>> >>
>> >> Binaries:
>> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>> >>
>> >> Docs:
>> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>> >>
>> >> Repository:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1012/
>> >>
>> >> == API Changes ==
>> >> If you want to test building against Spark there are some minor API
>> >> changes. We'll get these written up for the final release but I'm
>> noting a
>> >> few here (not comprehensive):
>> >>
>> >> changes to ML vector specification:
>> >>
>> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>> >>
>> >> changes to the Java API:
>> >>
>> >>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>> >>
>> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> >> ==> Call toSeq on the result to restore the old behavior
>> >>
>> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> >> ==> Call toSeq on the result to restore old behavior
>> >>
>> >> Streaming classes have been renamed:
>> >> NetworkReceiver -> Receiver
>> >>
>> >
>> >
>> >
>> > --
>> > Dean Wampler, Ph.D.
>> > Typesafe
>> > @deanwampler
>> > http://typesafe.com
>> > http://polyglotprogramming.com
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Posted by Dean Wampler <de...@gmail.com>.
Thanks. I'm fine with the logic change, although I was a bit surprised to
see Hadoop used for file I/O.

Anyway, the jira issue and pull request discussions mention a flag to
enable overwrites. That would be very convenient for a tutorial I'm
writing, although I wouldn't recommend it for normal use, of course.
However, I can't figure out if this actually exists. I found the
spark.files.overwrite property, but that doesn't apply.  Does this override
flag, method call, or method argument actually exist?

Thanks,
Dean


On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hi Dean,
>
> We always used the Hadoop libraries here to read and write local
> files. In Spark 1.0 we started enforcing the rule that you can't
> over-write an existing directory because it can cause
> confusing/undefined behavior if multiple jobs output to the directory
> (they partially clobber each other's output).
>
> https://issues.apache.org/jira/browse/SPARK-1100
> https://github.com/apache/spark/pull/11
>
> In the JIRA I actually proposed slightly deviating from Hadoop
> semantics and allowing the directory to exist if it is empty, but I
> think in the end we decided to just go with the exact same semantics
> as Hadoop (i.e. empty directories are a problem).
>
> - Patrick
>
> On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler <de...@gmail.com>
> wrote:
> > I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
> using
> > HDFS classes for file I/O, while the same script compiled and running
> with
> > 0.9.1 uses only the local-mode File IO.
> >
> > The script is a variation of the Word Count script. Here are the "guts":
> >
> > object WordCount2 {
> >   def main(args: Array[String]) = {
> >
> >     val sc = new SparkContext("local", "Word Count (2)")
> >
> >     val input = sc.textFile(".../some/local/file").map(line =>
> > line.toLowerCase)
> >     input.cache
> >
> >     val wc2 = input
> >       .flatMap(line => line.split("""\W+"""))
> >       .map(word => (word, 1))
> >       .reduceByKey((count1, count2) => count1 + count2)
> >
> >     wc2.saveAsTextFile("output/some/directory")
> >
> >     sc.stop()
> >
> > It works fine compiled and executed with 0.9.1. If I recompile and run
> with
> > 1.0.0-RC1, where the same output directory still exists, I get this
> > familiar Hadoop-ish exception:
> >
> > [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
> > Output directory
> >
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> > already exists
> > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> >
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> > already exists
> >  at
> >
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> > at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
> >  at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> > at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
> >  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> > at spark.activator.WordCount2$.main(WordCount2.scala:42)
> >  at spark.activator.WordCount2.main(WordCount2.scala)
> > ...
> >
> > Thoughts?
> >
> >
> > On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >
> >> Hey All,
> >>
> >> This is not an official vote, but I wanted to cut an RC so that people
> can
> >> test against the Maven artifacts, test building with their
> configuration,
> >> etc. We are still chasing down a few issues and updating docs, etc.
> >>
> >> If you have issues or bug reports for this release, please send an
> e-mail
> >> to the Spark dev list and/or file a JIRA.
> >>
> >> Commit: d636772 (v1.0.0-rc3)
> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
> >>
> >> Binaries:
> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
> >>
> >> Docs:
> >> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
> >>
> >> Repository:
> >> https://repository.apache.org/content/repositories/orgapachespark-1012/
> >>
> >> == API Changes ==
> >> If you want to test building against Spark there are some minor API
> >> changes. We'll get these written up for the final release but I'm
> noting a
> >> few here (not comprehensive):
> >>
> >> changes to ML vector specification:
> >>
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
> >>
> >> changes to the Java API:
> >>
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> >>
> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
> >> ==> Call toSeq on the result to restore the old behavior
> >>
> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> >> ==> Call toSeq on the result to restore old behavior
> >>
> >> Streaming classes have been renamed:
> >> NetworkReceiver -> Receiver
> >>
> >
> >
> >
> > --
> > Dean Wampler, Ph.D.
> > Typesafe
> > @deanwampler
> > http://typesafe.com
> > http://polyglotprogramming.com
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Posted by Patrick Wendell <pw...@gmail.com>.
Hi Dean,

We always used the Hadoop libraries here to read and write local
files. In Spark 1.0 we started enforcing the rule that you can't
over-write an existing directory because it can cause
confusing/undefined behavior if multiple jobs output to the directory
(they partially clobber each other's output).

https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

In the JIRA I actually proposed slightly deviating from Hadoop
semantics and allowing the directory to exist if it is empty, but I
think in the end we decided to just go with the exact same semantics
as Hadoop (i.e. empty directories are a problem).

- Patrick

On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler <de...@gmail.com> wrote:
> I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
> HDFS classes for file I/O, while the same script compiled and running with
> 0.9.1 uses only the local-mode File IO.
>
> The script is a variation of the Word Count script. Here are the "guts":
>
> object WordCount2 {
>   def main(args: Array[String]) = {
>
>     val sc = new SparkContext("local", "Word Count (2)")
>
>     val input = sc.textFile(".../some/local/file").map(line =>
> line.toLowerCase)
>     input.cache
>
>     val wc2 = input
>       .flatMap(line => line.split("""\W+"""))
>       .map(word => (word, 1))
>       .reduceByKey((count1, count2) => count1 + count2)
>
>     wc2.saveAsTextFile("output/some/directory")
>
>     sc.stop()
>
> It works fine compiled and executed with 0.9.1. If I recompile and run with
> 1.0.0-RC1, where the same output directory still exists, I get this
> familiar Hadoop-ish exception:
>
> [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
> already exists
>  at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
>  at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
>  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
> at spark.activator.WordCount2$.main(WordCount2.scala:42)
>  at spark.activator.WordCount2.main(WordCount2.scala)
> ...
>
> Thoughts?
>
>
> On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hey All,
>>
>> This is not an official vote, but I wanted to cut an RC so that people can
>> test against the Maven artifacts, test building with their configuration,
>> etc. We are still chasing down a few issues and updating docs, etc.
>>
>> If you have issues or bug reports for this release, please send an e-mail
>> to the Spark dev list and/or file a JIRA.
>>
>> Commit: d636772 (v1.0.0-rc3)
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>>
>> Binaries:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>>
>> Docs:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>>
>> Repository:
>> https://repository.apache.org/content/repositories/orgapachespark-1012/
>>
>> == API Changes ==
>> If you want to test building against Spark there are some minor API
>> changes. We'll get these written up for the final release but I'm noting a
>> few here (not comprehensive):
>>
>> changes to ML vector specification:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>>
>> changes to the Java API:
>>
>> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>>
>> coGroup and related functions now return Iterable[T] instead of Seq[T]
>> ==> Call toSeq on the result to restore the old behavior
>>
>> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
>> ==> Call toSeq on the result to restore old behavior
>>
>> Streaming classes have been renamed:
>> NetworkReceiver -> Receiver
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Posted by Dean Wampler <de...@gmail.com>.
I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
HDFS classes for file I/O, while the same script compiled and running with
0.9.1 uses only the local-mode File IO.

The script is a variation of the Word Count script. Here are the "guts":

object WordCount2 {
  def main(args: Array[String]) = {

    val sc = new SparkContext("local", "Word Count (2)")

    val input = sc.textFile(".../some/local/file").map(line =>
line.toLowerCase)
    input.cache

    val wc2 = input
      .flatMap(line => line.split("""\W+"""))
      .map(word => (word, 1))
      .reduceByKey((count1, count2) => count1 + count2)

    wc2.saveAsTextFile("output/some/directory")

    sc.stop()

It works fine compiled and executed with 0.9.1. If I recompile and run with
1.0.0-RC1, where the same output directory still exists, I get this
familiar Hadoop-ish exception:

[error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory
file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
already exists
 at
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
 at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
at spark.activator.WordCount2$.main(WordCount2.scala:42)
 at spark.activator.WordCount2.main(WordCount2.scala)
...

Thoughts?


On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey All,
>
> This is not an official vote, but I wanted to cut an RC so that people can
> test against the Maven artifacts, test building with their configuration,
> etc. We are still chasing down a few issues and updating docs, etc.
>
> If you have issues or bug reports for this release, please send an e-mail
> to the Spark dev list and/or file a JIRA.
>
> Commit: d636772 (v1.0.0-rc3)
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>
> Binaries:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>
> Docs:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>
> Repository:
> https://repository.apache.org/content/repositories/orgapachespark-1012/
>
> == API Changes ==
> If you want to test building against Spark there are some minor API
> changes. We'll get these written up for the final release but I'm noting a
> few here (not comprehensive):
>
> changes to ML vector specification:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>
> Streaming classes have been renamed:
> NetworkReceiver -> Receiver
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Posted by Nan Zhu <zh...@gmail.com>.
SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly 

and copy the generated jar to lib/ directory of my application, 

it seems that sbt cannot find the dependencies in the jar?

but everything works with the pre-built jar files downloaded from the link provided by Patrick

Best, 

-- 
Nan Zhu


On Thursday, May 1, 2014 at 11:16 PM, Madhu wrote:

> I'm guessing EC2 support is not there yet?
> 
> I was able to build using the binary download on both Windows 7 and RHEL 6
> without issues.
> I tried to create an EC2 cluster, but saw this:
> 
> ~/spark-ec2
> Initializing spark
> ~ ~/spark-ec2
> ERROR: Unknown Spark version
> Initializing shark
> ~ ~/spark-ec2 ~/spark-ec2
> ERROR: Unknown Shark version
> 
> The spark dir on the EC2 master has only a conf dir, so it didn't deploy
> properly.
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com (http://Nabble.com).
> 
> 



Re: Spark 1.0.0 rc3

Posted by Madhu <ma...@madhu.com>.
I'm guessing EC2 support is not there yet?

I was able to build using the binary download on both Windows 7 and RHEL 6
without issues.
I tried to create an EC2 cluster, but saw this:

~/spark-ec2
Initializing spark
~ ~/spark-ec2
ERROR: Unknown Spark version
Initializing shark
~ ~/spark-ec2 ~/spark-ec2
ERROR: Unknown Shark version

The spark dir on the EC2 master has only a conf dir, so it didn't deploy
properly.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.