You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/08/26 06:28:14 UTC

[VOTE] Release Apache Spark 1.5.0 (RC2)

Please vote on releasing the following candidate as Apache Spark version
1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/


The tag to be voted on is v1.5.0-rc2:
https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (published as 1.5.0-rc2) can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1141/

The staging repository for this release (published as 1.5.0) can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1140/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/


=======================================
How can I help test this release?
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


================================================
What justifies a -1 vote for this release?
================================================
This vote is happening towards the end of the 1.5 QA period, so -1 votes
should only occur for significant regressions from 1.4. Bugs already
present in 1.4, minor regressions, or bugs related to new features will not
block this release.


===============================================================
What should happen to JIRA tickets still targeting 1.5.0?
===============================================================
1. It is OK for documentation patches to target 1.5.0 and still go into
branch-1.5, since documentations will be packaged separately from the
release.
2. New features for non-alpha-modules should target 1.6+.
3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
version.


==================================================
Major changes to help you focus your testing
==================================================

As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.

RDD/DataFrame/SQL APIs

- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing

DataFrame/SQL Backend Execution

- Code generation on by default
- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms
- Improved window function performance
- Better metrics instrumentation and reporting for DF/SQL execution plans

Data Sources, Hive, Hadoop, Mesos and Cluster Management

- Dynamic allocation support in all resource managers (Mesos, YARN,
Standalone)
- Improved Mesos support (framework authentication, roles, dynamic
allocation, constraints)
- Improved YARN support (dynamic allocation with preferred locations)
- Improved Hive support (metastore partition pruning, metastore
connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
- Support persisting data in Hive compatible format in metastore
- Support data partitioning for JSON data sources
- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
discovery and schema merging, support reading non-standard legacy Parquet
files generated by other libraries)
- Faster and more robust dynamic partition insert
- DataSourceRegister interface for external data sources to specify short
names

SparkR

- YARN cluster mode in R
- GLMs with R formula, binomial/Gaussian families, and elastic-net
regularization
- Improved error messages
- Aliases to make DataFrame functions more R-like

Streaming

- Backpressure for handling bursty input streams.
- Improved Python support for streaming sources (Kafka offsets, Kinesis,
MQTT, Flume)
- Improved Python streaming machine learning algorithms (K-Means, linear
regression, logistic regression)
- Native reliable Kinesis stream support
- Input metadata like Kafka offsets made visible in the batch details UI
- Better load balancing and scheduling of receivers across cluster
- Include streaming storage in web UI

Machine Learning and Advanced Analytics

- Feature transformers: CountVectorizer, Discrete Cosine transformation,
MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
regression.
- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
test.
- Improvements to existing algorithms: LDA, trees/ensembles, GMMs
- More efficient Pregel API implementation for GraphX
- Model summary for linear and logistic regression.
- Python API: distributed matrices, streaming k-means and linear models,
LDA, power iteration clustering, etc.
- Tuning and evaluation: train-validation split and multiclass
classification evaluator.
- Documentation: document the release version of public API methods

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Calvin Jia <ji...@gmail.com>.

+1, tested that 1.5.0-RC2 works with Tachyon 0.7.1 as external block store.

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Reynold Xin <rx...@databricks.com>.

The Scala 2.11 issue should be fixed, but doesn't need to be a blocker,
since Maven builds fine. The sbt build is more aggressive to make sure we
catch warnings.



On Wed, Aug 26, 2015 at 10:01 AM, Sean Owen <so...@cloudera.com> wrote:

> My quick take: no blockers at this point, except for one potential
> issue. Still some 'critical' bugs worth a look. The release seems to
> pass tests but i get a lot of spurious failures; it took about 16
> hours of running tests to get everything to pass at least once.
>
>
> Current score: 56 issues targeted at 1.5.0, of which 14 bugs, of which
> no blockers and 8 critical.
>
> This one might be a blocker as it seems to mean that SBT + Scala 2.11
> does not compile:
> https://issues.apache.org/jira/browse/SPARK-10227
>
> pretty simple issue, but weigh in on the PR:
> https://github.com/apache/spark/pull/8433
>
> For reference here are the Critical ones:
>
> Key Component Summary Assignee
> SPARK-6484 Spark Core Ganglia metrics xml reporter doesn't escape
> correctly Josh Rosen
> SPARK-6701 Tests, YARN Flaky test: o.a.s.deploy.yarn.YarnClusterSuite
> Python application
> SPARK-7420 Tests Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not
> clear received block data too soon" Tathagata Das
> SPARK-8119 Spark Core HeartbeatReceiver should not adjust application
> executor resources Andrew Or
> SPARK-8414 Spark Core Ensure ContextCleaner actually triggers clean
> ups Andrew Or
> SPARK-8447 Shuffle Test external shuffle service with all shuffle managers
> SPARK-10224 Streaming BlockGenerator may lost data in the last block
> SPARK-10287 SQL After processing a query using JSON data, Spark SQL
> continuously refreshes metadata of the table
> Total: 8 issues
>
>
> I'm seeing the following tests fail intermittently, with "-Phive
> -Phive-thriftserver -Phadoop-2.6" on Ubuntu 15 / Java 7:
>
> - security mismatch password *** FAILED ***
>   Expected exception java.io.IOException to be thrown, but
> java.nio.channels.CancelledKeyException was thrown.
> (ConnectionManagerSuite.scala:123)
>
>
> DAGSchedulerSuite:
> ...
> - misbehaved resultHandler should not crash DAGScheduler and
> SparkContext *** FAILED ***
>   java.lang.UnsupportedOperationException: taskSucceeded() called on a
> finished JobWaiter was not instance of
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
> (DAGSchedulerSuite.scala:861)
>
> HeartbeatReceiverSuite:
> ...
> - normal heartbeat *** FAILED ***
>   3 did not equal 2 (HeartbeatReceiverSuite.scala:104)
>
>
> - Unpersisting HttpBroadcast on executors only in distributed mode ***
> FAILED ***
>   ...
> - Unpersisting HttpBroadcast on executors and driver in distributed
> mode *** FAILED ***
>   ...
> - Unpersisting TorrentBroadcast on executors only in distributed mode
> *** FAILED ***
>   ...
> - Unpersisting TorrentBroadcast on executors and driver in distributed
> mode *** FAILED ***
>
>
> StreamingContextSuite:
> ...
> - stop gracefully *** FAILED ***
>   1749735 did not equal 1190429 Received records = 1749735, processed
> records = 1190428 (StreamingContextSuite.scala:279)
>
>
> DirectKafkaStreamSuite:
> - offset recovery *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 193
> times over 10.010808486 seconds. Last failure message:
> strings.forall({
>     ((elem: Any) => DirectKafkaStreamSuite.collectedData.contains(elem))
>   }) was false. (DirectKafkaStreamSuite.scala:249)
>
> On Wed, Aug 26, 2015 at 5:28 AM, Reynold Xin <rx...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
> passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.5.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> >
> > The tag to be voted on is v1.5.0-rc2:
> >
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release (published as 1.5.0-rc2) can be
> > found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1141/
> >
> > The staging repository for this release (published as 1.5.0) can be found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1140/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
> >
> >
> > =======================================
> > How can I help test this release?
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> >
> > ================================================
> > What justifies a -1 vote for this release?
> > ================================================
> > This vote is happening towards the end of the 1.5 QA period, so -1 votes
> > should only occur for significant regressions from 1.4. Bugs already
> present
> > in 1.4, minor regressions, or bugs related to new features will not block
> > this release.
> >
> >
> > ===============================================================
> > What should happen to JIRA tickets still targeting 1.5.0?
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.5.0 and still go into
> > branch-1.5, since documentations will be packaged separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.6+.
> > 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > Major changes to help you focus your testing
> > ==================================================
> >
> > As of today, Spark 1.5 contains more than 1000 commits from 220+
> > contributors. I've curated a list of important changes for 1.5. For the
> > complete list, please refer to Apache JIRA changelog.
> >
> > RDD/DataFrame/SQL APIs
> >
> > - New UDAF interface
> > - DataFrame hints for broadcast join
> > - expr function for turning a SQL expression into DataFrame column
> > - Improved support for NaN values
> > - StructType now supports ordering
> > - TimestampType precision is reduced to 1us
> > - 100 new built-in expressions, including date/time, string, math
> > - memory and local disk only checkpointing
> >
> > DataFrame/SQL Backend Execution
> >
> > - Code generation on by default
> > - Improved join, aggregation, shuffle, sorting with cache friendly
> > algorithms and external algorithms
> > - Improved window function performance
> > - Better metrics instrumentation and reporting for DF/SQL execution plans
> >
> > Data Sources, Hive, Hadoop, Mesos and Cluster Management
> >
> > - Dynamic allocation support in all resource managers (Mesos, YARN,
> > Standalone)
> > - Improved Mesos support (framework authentication, roles, dynamic
> > allocation, constraints)
> > - Improved YARN support (dynamic allocation with preferred locations)
> > - Improved Hive support (metastore partition pruning, metastore
> connectivity
> > to 0.13 to 1.2, internal Hive upgrade to 1.2)
> > - Support persisting data in Hive compatible format in metastore
> > - Support data partitioning for JSON data sources
> > - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata
> > discovery and schema merging, support reading non-standard legacy Parquet
> > files generated by other libraries)
> > - Faster and more robust dynamic partition insert
> > - DataSourceRegister interface for external data sources to specify short
> > names
> >
> > SparkR
> >
> > - YARN cluster mode in R
> > - GLMs with R formula, binomial/Gaussian families, and elastic-net
> > regularization
> > - Improved error messages
> > - Aliases to make DataFrame functions more R-like
> >
> > Streaming
> >
> > - Backpressure for handling bursty input streams.
> > - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> > MQTT, Flume)
> > - Improved Python streaming machine learning algorithms (K-Means, linear
> > regression, logistic regression)
> > - Native reliable Kinesis stream support
> > - Input metadata like Kafka offsets made visible in the batch details UI
> > - Better load balancing and scheduling of receivers across cluster
> > - Include streaming storage in web UI
> >
> > Machine Learning and Advanced Analytics
> >
> > - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> > MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> > - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> > regression.
> > - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> > pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> > test.
> > - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> > - More efficient Pregel API implementation for GraphX
> > - Model summary for linear and logistic regression.
> > - Python API: distributed matrices, streaming k-means and linear models,
> > LDA, power iteration clustering, etc.
> > - Tuning and evaluation: train-validation split and multiclass
> > classification evaluator.
> > - Documentation: document the release version of public API methods
> >
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

My quick take: no blockers at this point, except for one potential
issue. Still some 'critical' bugs worth a look. The release seems to
pass tests but i get a lot of spurious failures; it took about 16
hours of running tests to get everything to pass at least once.


Current score: 56 issues targeted at 1.5.0, of which 14 bugs, of which
no blockers and 8 critical.

This one might be a blocker as it seems to mean that SBT + Scala 2.11
does not compile:
https://issues.apache.org/jira/browse/SPARK-10227

pretty simple issue, but weigh in on the PR:
https://github.com/apache/spark/pull/8433

For reference here are the Critical ones:

Key Component Summary Assignee
SPARK-6484 Spark Core Ganglia metrics xml reporter doesn't escape
correctly Josh Rosen
SPARK-6701 Tests, YARN Flaky test: o.a.s.deploy.yarn.YarnClusterSuite
Python application
SPARK-7420 Tests Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not
clear received block data too soon" Tathagata Das
SPARK-8119 Spark Core HeartbeatReceiver should not adjust application
executor resources Andrew Or
SPARK-8414 Spark Core Ensure ContextCleaner actually triggers clean
ups Andrew Or
SPARK-8447 Shuffle Test external shuffle service with all shuffle managers
SPARK-10224 Streaming BlockGenerator may lost data in the last block
SPARK-10287 SQL After processing a query using JSON data, Spark SQL
continuously refreshes metadata of the table
Total: 8 issues


I'm seeing the following tests fail intermittently, with "-Phive
-Phive-thriftserver -Phadoop-2.6" on Ubuntu 15 / Java 7:

- security mismatch password *** FAILED ***
  Expected exception java.io.IOException to be thrown, but
java.nio.channels.CancelledKeyException was thrown.
(ConnectionManagerSuite.scala:123)


DAGSchedulerSuite:
...
- misbehaved resultHandler should not crash DAGScheduler and
SparkContext *** FAILED ***
  java.lang.UnsupportedOperationException: taskSucceeded() called on a
finished JobWaiter was not instance of
org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
(DAGSchedulerSuite.scala:861)

HeartbeatReceiverSuite:
...
- normal heartbeat *** FAILED ***
  3 did not equal 2 (HeartbeatReceiverSuite.scala:104)


- Unpersisting HttpBroadcast on executors only in distributed mode ***
FAILED ***
  ...
- Unpersisting HttpBroadcast on executors and driver in distributed
mode *** FAILED ***
  ...
- Unpersisting TorrentBroadcast on executors only in distributed mode
*** FAILED ***
  ...
- Unpersisting TorrentBroadcast on executors and driver in distributed
mode *** FAILED ***


StreamingContextSuite:
...
- stop gracefully *** FAILED ***
  1749735 did not equal 1190429 Received records = 1749735, processed
records = 1190428 (StreamingContextSuite.scala:279)


DirectKafkaStreamSuite:
- offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 193
times over 10.010808486 seconds. Last failure message:
strings.forall({
    ((elem: Any) => DirectKafkaStreamSuite.collectedData.contains(elem))
  }) was false. (DirectKafkaStreamSuite.scala:249)

On Wed, Aug 26, 2015 at 5:28 AM, Reynold Xin <rx...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already present
> in 1.4, minor regressions, or bugs related to new features will not block
> this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore connectivity
> to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
> discovery and schema merging, support reading non-standard legacy Parquet
> files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

I've seen similar tar file warnings and in my case it was because I
was using the default tar on a Macbook. Using gnu-tar from brew made
the warnings go away.

Thanks
Shivaram

On Fri, Aug 28, 2015 at 2:37 PM, Luciano Resende <lu...@gmail.com> wrote:
> The binary archives seems to be having some issues, which seems consistent
> on few of the different ones (different versions of hadoop) that I tried.
>
>  tar -xvf spark-1.5.0-bin-hadoop2.6.tgz
>
> x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
> x spark-1.5.0-bin-hadoop2.6/README.md
> tar: copyfile unpack
> (spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
> failed: No such file or directory
>
> tar tzf spark-1.5.0-bin-hadoop2.3.tgz | grep SUCCESS.crc
> spark-1.5.0-bin-hadoop2.3/python/test_support/sql/orc_partitioned/._SUCCESS.crc
>
> This seems similar to a problem Avro release was having recently.
>
>
> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc2:
>>
>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc2) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>
>>
>> =======================================
>> How can I help test this release?
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> ================================================
>> What justifies a -1 vote for this release?
>> ================================================
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already present
>> in 1.4, minor regressions, or bugs related to new features will not block
>> this release.
>>
>>
>> ===============================================================
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> Major changes to help you focus your testing
>> ==================================================
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, including date/time, string, math
>> - memory and local disk only checkpointing
>>
>> DataFrame/SQL Backend Execution
>>
>> - Code generation on by default
>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> algorithms and external algorithms
>> - Improved window function performance
>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>
>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>
>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> Standalone)
>> - Improved Mesos support (framework authentication, roles, dynamic
>> allocation, constraints)
>> - Improved YARN support (dynamic allocation with preferred locations)
>> - Improved Hive support (metastore partition pruning, metastore
>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>> - Support persisting data in Hive compatible format in metastore
>> - Support data partitioning for JSON data sources
>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>> metadata discovery and schema merging, support reading non-standard legacy
>> Parquet files generated by other libraries)
>> - Faster and more robust dynamic partition insert
>> - DataSourceRegister interface for external data sources to specify short
>> names
>>
>> SparkR
>>
>> - YARN cluster mode in R
>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>> regularization
>> - Improved error messages
>> - Aliases to make DataFrame functions more R-like
>>
>> Streaming
>>
>> - Backpressure for handling bursty input streams.
>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>> MQTT, Flume)
>> - Improved Python streaming machine learning algorithms (K-Means, linear
>> regression, logistic regression)
>> - Native reliable Kinesis stream support
>> - Input metadata like Kafka offsets made visible in the batch details UI
>> - Better load balancing and scheduling of receivers across cluster
>> - Include streaming storage in web UI
>>
>> Machine Learning and Advanced Analytics
>>
>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>> regression.
>> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
>> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
>> test.
>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>> - More efficient Pregel API implementation for GraphX
>> - Model summary for linear and logistic regression.
>> - Python API: distributed matrices, streaming k-means and linear models,
>> LDA, power iteration clustering, etc.
>> - Tuning and evaluation: train-validation split and multiclass
>> classification evaluator.
>> - Documentation: document the release version of public API methods
>>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Luciano Resende <lu...@gmail.com>.

The binary archives seems to be having some issues, which seems consistent
on few of the different ones (different versions of hadoop) that I tried.

 tar -xvf spark-1.5.0-bin-hadoop2.6.tgz

x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
x spark-1.5.0-bin-hadoop2.6/README.md
tar: copyfile unpack
(spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
failed: No such file or directory

tar tzf spark-1.5.0-bin-hadoop2.3.tgz | grep SUCCESS.crc
spark-1.5.0-bin-hadoop2.3/python/test_support/sql/orc_partitioned/._SUCCESS.crc

This seems similar to a problem Avro release was having recently.


On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Reynold Xin <rx...@databricks.com>.

One small update -- the vote should close Saturday Aug 29. Not Friday Aug
29.


On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by rake <ra...@randykerber.com>.

rxin wrote
> ....
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> ....

I was looking for a version of this release based on Hadoop 2.2.  Is there
some reason there isn't one, especially since 2.2.0 is described as the
default Hadoop version?

-- Randy Kerber



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13828.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Luc Bourlier <lu...@typesafe.com>.

- tested the backpressure/rate controlling in streaming. It works as
expected.
- there is a problem with the Scala 2.11 sbt build:
https://issues.apache.org/jira/browse/SPARK-10227

Luc Bourlier

Luc Bourlier
*Spark Team  - Typesafe, Inc.*
luc.bourlier@typesafe.com

<http://www.typesafe.com>

On Wed, Aug 26, 2015 at 6:28 AM, Reynold Xin <rx...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Yin Huai <yh...@databricks.com>.

-1

Found a problem on reading partitioned table. Right now, we may create a
SQL project/filter operator for every partition. When we have thousands of
partitions, there will be a huge number of SQLMetrics (accumulators), which
causes high memory pressure to the driver and then takes down the cluster
(long GC time causes different kinds of timeouts).

https://issues.apache.org/jira/browse/SPARK-10339

Will have a fix soon.

On Fri, Aug 28, 2015 at 3:18 PM, Jon Bender <jo...@gmail.com>
wrote:

> Marcelo,
>
> Thanks for replying -- after looking at my test again, I misinterpreted
> another issue I'm seeing which is unrelated (note I'm not using a pre-built
> binary, rather had to build my own with Yarn/Hive support, as I want to use
> it on an older cluster (CDH5.1.0)).
>
> I can start up a pyspark app on YARN, so I don't want to block this.  +1
>
> Best,
> Jonathan
>
> On Fri, Aug 28, 2015 at 2:34 PM, Marcelo Vanzin <va...@cloudera.com>
> wrote:
>
>> Hi Jonathan,
>>
>> Can you be more specific about what problem you're running into?
>>
>> SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
>> pyspark archives separately to YARN. With that fix in place, pyspark
>> doesn't need to get anything from the Spark assembly, so it has no
>> problems running on YARN. I just downloaded
>> spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
>> fine on YARN for me.
>>
>>
>> On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
>> <jo...@gmail.com> wrote:
>> > -1 for regression on PySpark + YARN support
>> >
>> > It seems like this JIRA
>> https://issues.apache.org/jira/browse/SPARK-7733
>> > added a requirement for Java 7 in the build process.  Due to some quirks
>> > with the Java archive format changes between Java 6 and 7, using PySpark
>> > with a YARN uberjar seems to break when compiled with anything after
>> Java 6
>> > (see https://issues.apache.org/jira/browse/SPARK-1920 for reference).
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: dev-help@spark.apache.org
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Jon Bender <jo...@gmail.com>.

Marcelo,

Thanks for replying -- after looking at my test again, I misinterpreted
another issue I'm seeing which is unrelated (note I'm not using a pre-built
binary, rather had to build my own with Yarn/Hive support, as I want to use
it on an older cluster (CDH5.1.0)).

I can start up a pyspark app on YARN, so I don't want to block this.  +1

Best,
Jonathan

On Fri, Aug 28, 2015 at 2:34 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> Hi Jonathan,
>
> Can you be more specific about what problem you're running into?
>
> SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
> pyspark archives separately to YARN. With that fix in place, pyspark
> doesn't need to get anything from the Spark assembly, so it has no
> problems running on YARN. I just downloaded
> spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
> fine on YARN for me.
>
>
> On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
> <jo...@gmail.com> wrote:
> > -1 for regression on PySpark + YARN support
> >
> > It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
> > added a requirement for Java 7 in the build process.  Due to some quirks
> > with the Java archive format changes between Java 6 and 7, using PySpark
> > with a YARN uberjar seems to break when compiled with anything after
> Java 6
> > (see https://issues.apache.org/jira/browse/SPARK-1920 for reference).
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
>
>
> --
> Marcelo
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Jonathan,

Can you be more specific about what problem you're running into?

SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
pyspark archives separately to YARN. With that fix in place, pyspark
doesn't need to get anything from the Spark assembly, so it has no
problems running on YARN. I just downloaded
spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
fine on YARN for me.


On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
<jo...@gmail.com> wrote:
> -1 for regression on PySpark + YARN support
>
> It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
> added a requirement for Java 7 in the build process.  Due to some quirks
> with the Java archive format changes between Java 6 and 7, using PySpark
> with a YARN uberjar seems to break when compiled with anything after Java 6
> (see https://issues.apache.org/jira/browse/SPARK-1920 for reference).
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Jonathan Bender <jo...@gmail.com>.

-1 for regression on PySpark + YARN support

It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
added a requirement for Java 7 in the build process.  Due to some quirks
with the Java archive format changes between Java 6 and 7, using PySpark
with a YARN uberjar seems to break when compiled with anything after Java 6
(see https://issues.apache.org/jira/browse/SPARK-1920 for reference).



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Chester Chen <ch...@alpinenow.com>.

Thanks Sean, that make it clear.

On Tue, Sep 1, 2015 at 7:17 AM, Sean Owen <so...@cloudera.com> wrote:

> Any 1.5 RC comes from the latest state of the 1.5 branch at some point
> in time. The next RC will be cut from whatever the latest commit is.
> You can see the tags in git for the specific commits for each RC.
> There's no such thing as "1.5.1 SNAPSHOT" commits, just commits to
> branch 1.5. I would ignore the "SNAPSHOT" version for your purpose.
>
> You can always build from the exact commit that an RC did by looking
> at tags. There is no 1.5.0 yet so you can't build that, but once it's
> released, you would be able to find its tag as well. You can always
> build the latest 1.5.x branch by building from HEAD of that branch.
>
> On Tue, Sep 1, 2015 at 3:13 PM,  <ch...@alpinenow.com> wrote:
> > Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I
> assume it would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code
> ?
> >
> > The reason I am asking these questions is that I would like to know If I
> want build 1.5.0  myself, which commit should I use ?
> >
> > Sent from my iPad
> >
> >> On Sep 1, 2015, at 6:57 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
> >> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
> >> released. In practice I think it's simpler to follow the defaults of
> >> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
> >> any 1.5.0-rc is released. It doesn't affect later RCs. This has
> >> nothing to do with what commits go into 1.5.0; it's an ignorable
> >> detail of the version in POMs in the source tree, which don't mean
> >> much anyway as the source tree itself is not a released version.
> >>
> >>> On Tue, Sep 1, 2015 at 2:48 PM,  <ch...@alpinenow.com> wrote:
> >>> Sorry, I am still not follow. I assume the release would build from
> 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from
> 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last
> commit of 1.5.0 (before changing to 1.5.1 snapshot) ?
> >>>
> >>>
> >>>
> >>> Sent from my iPad
> >>>
> >>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
> >>>>
> >>>> That's correct for the 1.5 branch, right? this doesn't mean that the
> >>>> next RC would have this value. You choose the release version during
> >>>> the release process.
> >>>>
> >>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ch...@alpinenow.com>
> wrote:
> >>>>> Seems that Github branch-1.5 already changing the version to
> 1.5.1-SNAPSHOT,
> >>>>>
> >>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
> >>>>>
> >>>>> Chester
> >>>>>
> >>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>>>>
> >>>>>> I'm going to -1 the release myself since the issue @yhuai
> identified is
> >>>>>> pretty serious. It basically OOMs the driver for reading any files
> with a
> >>>>>> large number of partitions. Looks like the patch for that has
> already been
> >>>>>> merged.
> >>>>>>
> >>>>>> I'm going to cut rc3 momentarily.
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <
> sandy.ryza@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> +1 (non-binding)
> >>>>>>> built from source and ran some jobs against YARN
> >>>>>>>
> >>>>>>> -Sandy
> >>>>>>>
> >>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <
> vaquar.khan@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Vaquar khan
> >>>>>>>>
> >>>>>>>> +1 (non-binding, of course)
> >>>>>>>>
> >>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
> >>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> >>>>>>>> 2. Tested pyspark, mllib
> >>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
> >>>>>>>> 2.3. Decision Tree, Naive Bayes OK
> >>>>>>>> 2.4. KMeans OK
> >>>>>>>>      Center And Scale OK
> >>>>>>>> 2.5. RDD operations OK
> >>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey (word
> >>>>>>>> count)
> >>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >>>>>>>>      Model evaluation/optimization (rank, numIter, lambda) with
> >>>>>>>> itertools OK
> >>>>>>>> 3. Scala - MLlib
> >>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>>>>>> 3.2. LinearRegressionWithSGD OK
> >>>>>>>> 3.3. Decision Tree OK
> >>>>>>>> 3.4. KMeans OK
> >>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >>>>>>>> 3.6. saveAsParquetFile OK
> >>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> >>>>>>>> registerTempTable, sql OK
> >>>>>>>> 3.8. result = sqlContext.sql("SELECT
> >>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM
> Orders INNER
> >>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> >>>>>>>> 4.0. Spark SQL from Python OK
> >>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State =
> 'WA'")
> >>>>>>>> OK
> >>>>>>>> 5.0. Packages
> >>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
> >>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t
> work. But
> >>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
> >>>>>>>> 6.0. DataFrames
> >>>>>>>> 6.1. cast,dtypes OK
> >>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> >>>>>>>> 6.3. joins,sql,set operations,udf OK
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>> <k/>
> >>>>>>>>
> >>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rxin@databricks.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Please vote on releasing the following candidate as Apache Spark
> >>>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at
> 5:00 UTC and
> >>>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
> >>>>>>>>>
> >>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
> >>>>>>>>> [ ] -1 Do not release this package because ...
> >>>>>>>>>
> >>>>>>>>> To learn more about Apache Spark, please see
> http://spark.apache.org/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The tag to be voted on is v1.5.0-rc2:
> >>>>>>>>>
> >>>>>>>>>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
> >>>>>>>>>
> >>>>>>>>> The release files, including signatures, digests, etc. can be
> found at:
> >>>>>>>>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
> >>>>>>>>>
> >>>>>>>>> Release artifacts are signed with the following key:
> >>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
> >>>>>>>>>
> >>>>>>>>> The staging repository for this release (published as 1.5.0-rc2)
> can be
> >>>>>>>>> found at:
> >>>>>>>>>
> https://repository.apache.org/content/repositories/orgapachespark-1141/
> >>>>>>>>>
> >>>>>>>>> The staging repository for this release (published as 1.5.0) can
> be
> >>>>>>>>> found at:
> >>>>>>>>>
> https://repository.apache.org/content/repositories/orgapachespark-1140/
> >>>>>>>>>
> >>>>>>>>> The documentation corresponding to this release can be found at:
> >>>>>>>>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> =======================================
> >>>>>>>>> How can I help test this release?
> >>>>>>>>> =======================================
> >>>>>>>>> If you are a Spark user, you can help us test this release by
> taking an
> >>>>>>>>> existing Spark workload and running on this release candidate,
> then
> >>>>>>>>> reporting any regressions.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ================================================
> >>>>>>>>> What justifies a -1 vote for this release?
> >>>>>>>>> ================================================
> >>>>>>>>> This vote is happening towards the end of the 1.5 QA period, so
> -1
> >>>>>>>>> votes should only occur for significant regressions from 1.4.
> Bugs already
> >>>>>>>>> present in 1.4, minor regressions, or bugs related to new
> features will not
> >>>>>>>>> block this release.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ===============================================================
> >>>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
> >>>>>>>>> ===============================================================
> >>>>>>>>> 1. It is OK for documentation patches to target 1.5.0 and still
> go into
> >>>>>>>>> branch-1.5, since documentations will be packaged separately
> from the
> >>>>>>>>> release.
> >>>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
> >>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop
> the
> >>>>>>>>> target version.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ==================================================
> >>>>>>>>> Major changes to help you focus your testing
> >>>>>>>>> ==================================================
> >>>>>>>>>
> >>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
> >>>>>>>>> contributors. I've curated a list of important changes for 1.5.
> For the
> >>>>>>>>> complete list, please refer to Apache JIRA changelog.
> >>>>>>>>>
> >>>>>>>>> RDD/DataFrame/SQL APIs
> >>>>>>>>>
> >>>>>>>>> - New UDAF interface
> >>>>>>>>> - DataFrame hints for broadcast join
> >>>>>>>>> - expr function for turning a SQL expression into DataFrame
> column
> >>>>>>>>> - Improved support for NaN values
> >>>>>>>>> - StructType now supports ordering
> >>>>>>>>> - TimestampType precision is reduced to 1us
> >>>>>>>>> - 100 new built-in expressions, including date/time, string, math
> >>>>>>>>> - memory and local disk only checkpointing
> >>>>>>>>>
> >>>>>>>>> DataFrame/SQL Backend Execution
> >>>>>>>>>
> >>>>>>>>> - Code generation on by default
> >>>>>>>>> - Improved join, aggregation, shuffle, sorting with cache
> friendly
> >>>>>>>>> algorithms and external algorithms
> >>>>>>>>> - Improved window function performance
> >>>>>>>>> - Better metrics instrumentation and reporting for DF/SQL
> execution
> >>>>>>>>> plans
> >>>>>>>>>
> >>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
> >>>>>>>>>
> >>>>>>>>> - Dynamic allocation support in all resource managers (Mesos,
> YARN,
> >>>>>>>>> Standalone)
> >>>>>>>>> - Improved Mesos support (framework authentication, roles,
> dynamic
> >>>>>>>>> allocation, constraints)
> >>>>>>>>> - Improved YARN support (dynamic allocation with preferred
> locations)
> >>>>>>>>> - Improved Hive support (metastore partition pruning, metastore
> >>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> >>>>>>>>> - Support persisting data in Hive compatible format in metastore
> >>>>>>>>> - Support data partitioning for JSON data sources
> >>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown,
> faster
> >>>>>>>>> metadata discovery and schema merging, support reading
> non-standard legacy
> >>>>>>>>> Parquet files generated by other libraries)
> >>>>>>>>> - Faster and more robust dynamic partition insert
> >>>>>>>>> - DataSourceRegister interface for external data sources to
> specify
> >>>>>>>>> short names
> >>>>>>>>>
> >>>>>>>>> SparkR
> >>>>>>>>>
> >>>>>>>>> - YARN cluster mode in R
> >>>>>>>>> - GLMs with R formula, binomial/Gaussian families, and
> elastic-net
> >>>>>>>>> regularization
> >>>>>>>>> - Improved error messages
> >>>>>>>>> - Aliases to make DataFrame functions more R-like
> >>>>>>>>>
> >>>>>>>>> Streaming
> >>>>>>>>>
> >>>>>>>>> - Backpressure for handling bursty input streams.
> >>>>>>>>> - Improved Python support for streaming sources (Kafka offsets,
> >>>>>>>>> Kinesis, MQTT, Flume)
> >>>>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
> >>>>>>>>> linear regression, logistic regression)
> >>>>>>>>> - Native reliable Kinesis stream support
> >>>>>>>>> - Input metadata like Kafka offsets made visible in the batch
> details
> >>>>>>>>> UI
> >>>>>>>>> - Better load balancing and scheduling of receivers across
> cluster
> >>>>>>>>> - Include streaming storage in web UI
> >>>>>>>>>
> >>>>>>>>> Machine Learning and Advanced Analytics
> >>>>>>>>>
> >>>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
> >>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula,
> StopWordsRemover, and
> >>>>>>>>> VectorSlicer.
> >>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and
> isotonic
> >>>>>>>>> regression.
> >>>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
> >>>>>>>>> sequential pattern mining, association rule generation, 1-sample
> >>>>>>>>> Kolmogorov-Smirnov test.
> >>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> >>>>>>>>> - More efficient Pregel API implementation for GraphX
> >>>>>>>>> - Model summary for linear and logistic regression.
> >>>>>>>>> - Python API: distributed matrices, streaming k-means and linear
> >>>>>>>>> models, LDA, power iteration clustering, etc.
> >>>>>>>>> - Tuning and evaluation: train-validation split and multiclass
> >>>>>>>>> classification evaluator.
> >>>>>>>>> - Documentation: document the release version of public API
> methods
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

Any 1.5 RC comes from the latest state of the 1.5 branch at some point
in time. The next RC will be cut from whatever the latest commit is.
You can see the tags in git for the specific commits for each RC.
There's no such thing as "1.5.1 SNAPSHOT" commits, just commits to
branch 1.5. I would ignore the "SNAPSHOT" version for your purpose.

You can always build from the exact commit that an RC did by looking
at tags. There is no 1.5.0 yet so you can't build that, but once it's
released, you would be able to find its tag as well. You can always
build the latest 1.5.x branch by building from HEAD of that branch.

On Tue, Sep 1, 2015 at 3:13 PM,  <ch...@alpinenow.com> wrote:
> Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I assume it would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code ?
>
> The reason I am asking these questions is that I would like to know If I want build 1.5.0  myself, which commit should I use ?
>
> Sent from my iPad
>
>> On Sep 1, 2015, at 6:57 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
>> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
>> released. In practice I think it's simpler to follow the defaults of
>> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
>> any 1.5.0-rc is released. It doesn't affect later RCs. This has
>> nothing to do with what commits go into 1.5.0; it's an ignorable
>> detail of the version in POMs in the source tree, which don't mean
>> much anyway as the source tree itself is not a released version.
>>
>>> On Tue, Sep 1, 2015 at 2:48 PM,  <ch...@alpinenow.com> wrote:
>>> Sorry, I am still not follow. I assume the release would build from 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot) ?
>>>
>>>
>>>
>>> Sent from my iPad
>>>
>>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> That's correct for the 1.5 branch, right? this doesn't mean that the
>>>> next RC would have this value. You choose the release version during
>>>> the release process.
>>>>
>>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ch...@alpinenow.com> wrote:
>>>>> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>>>>>
>>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>>>>>
>>>>> Chester
>>>>>
>>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>>>>
>>>>>> I'm going to -1 the release myself since the issue @yhuai identified is
>>>>>> pretty serious. It basically OOMs the driver for reading any files with a
>>>>>> large number of partitions. Looks like the patch for that has already been
>>>>>> merged.
>>>>>>
>>>>>> I'm going to cut rc3 momentarily.
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>> built from source and ran some jobs against YARN
>>>>>>>
>>>>>>> -Sandy
>>>>>>>
>>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vaquar khan
>>>>>>>>
>>>>>>>> +1 (non-binding, of course)
>>>>>>>>
>>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>>>>> 2. Tested pyspark, mllib
>>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>>>>> 2.4. KMeans OK
>>>>>>>>      Center And Scale OK
>>>>>>>> 2.5. RDD operations OK
>>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>>>>>> count)
>>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>>>      Model evaluation/optimization (rank, numIter, lambda) with
>>>>>>>> itertools OK
>>>>>>>> 3. Scala - MLlib
>>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>>> 3.2. LinearRegressionWithSGD OK
>>>>>>>> 3.3. Decision Tree OK
>>>>>>>> 3.4. KMeans OK
>>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>>> 3.6. saveAsParquetFile OK
>>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>>>>> registerTempTable, sql OK
>>>>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>>>>>> 4.0. Spark SQL from Python OK
>>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>>>>>> OK
>>>>>>>> 5.0. Packages
>>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>>>>> 6.0. DataFrames
>>>>>>>> 6.1. cast,dtypes OK
>>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>>>>> 6.3. joins,sql,set operations,udf OK
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> <k/>
>>>>>>>>
>>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
>>>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>>>>>
>>>>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>>>>>
>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>
>>>>>>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>>>>>>> found at:
>>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>>>>>
>>>>>>>>> The staging repository for this release (published as 1.5.0) can be
>>>>>>>>> found at:
>>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>>>>>
>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> =======================================
>>>>>>>>> How can I help test this release?
>>>>>>>>> =======================================
>>>>>>>>> If you are a Spark user, you can help us test this release by taking an
>>>>>>>>> existing Spark workload and running on this release candidate, then
>>>>>>>>> reporting any regressions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ================================================
>>>>>>>>> What justifies a -1 vote for this release?
>>>>>>>>> ================================================
>>>>>>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>>>>>>> votes should only occur for significant regressions from 1.4. Bugs already
>>>>>>>>> present in 1.4, minor regressions, or bugs related to new features will not
>>>>>>>>> block this release.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ===============================================================
>>>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>>>>> ===============================================================
>>>>>>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>>>>>>> branch-1.5, since documentations will be packaged separately from the
>>>>>>>>> release.
>>>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>>>>>>> target version.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ==================================================
>>>>>>>>> Major changes to help you focus your testing
>>>>>>>>> ==================================================
>>>>>>>>>
>>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>>>>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>>>>>
>>>>>>>>> RDD/DataFrame/SQL APIs
>>>>>>>>>
>>>>>>>>> - New UDAF interface
>>>>>>>>> - DataFrame hints for broadcast join
>>>>>>>>> - expr function for turning a SQL expression into DataFrame column
>>>>>>>>> - Improved support for NaN values
>>>>>>>>> - StructType now supports ordering
>>>>>>>>> - TimestampType precision is reduced to 1us
>>>>>>>>> - 100 new built-in expressions, including date/time, string, math
>>>>>>>>> - memory and local disk only checkpointing
>>>>>>>>>
>>>>>>>>> DataFrame/SQL Backend Execution
>>>>>>>>>
>>>>>>>>> - Code generation on by default
>>>>>>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>>>>>>> algorithms and external algorithms
>>>>>>>>> - Improved window function performance
>>>>>>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>>>>>>> plans
>>>>>>>>>
>>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>>>>>
>>>>>>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>>>>>>> Standalone)
>>>>>>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>>>>>>> allocation, constraints)
>>>>>>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>>>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>>>>>> - Support persisting data in Hive compatible format in metastore
>>>>>>>>> - Support data partitioning for JSON data sources
>>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>>>>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>>>>>>> Parquet files generated by other libraries)
>>>>>>>>> - Faster and more robust dynamic partition insert
>>>>>>>>> - DataSourceRegister interface for external data sources to specify
>>>>>>>>> short names
>>>>>>>>>
>>>>>>>>> SparkR
>>>>>>>>>
>>>>>>>>> - YARN cluster mode in R
>>>>>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>>>>>> regularization
>>>>>>>>> - Improved error messages
>>>>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>>>>>
>>>>>>>>> Streaming
>>>>>>>>>
>>>>>>>>> - Backpressure for handling bursty input streams.
>>>>>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>>>>>> Kinesis, MQTT, Flume)
>>>>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>>>>>> linear regression, logistic regression)
>>>>>>>>> - Native reliable Kinesis stream support
>>>>>>>>> - Input metadata like Kafka offsets made visible in the batch details
>>>>>>>>> UI
>>>>>>>>> - Better load balancing and scheduling of receivers across cluster
>>>>>>>>> - Include streaming storage in web UI
>>>>>>>>>
>>>>>>>>> Machine Learning and Advanced Analytics
>>>>>>>>>
>>>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>>>>>>>>> VectorSlicer.
>>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>>>>>>> regression.
>>>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>>>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>>>>>> Kolmogorov-Smirnov test.
>>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>>>>> - Model summary for linear and logistic regression.
>>>>>>>>> - Python API: distributed matrices, streaming k-means and linear
>>>>>>>>> models, LDA, power iteration clustering, etc.
>>>>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>>>>> classification evaluator.
>>>>>>>>> - Documentation: document the release version of public API methods
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by ch...@alpinenow.com.

Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I assume it would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code ? 

The reason I am asking these questions is that I would like to know If I want build 1.5.0  myself, which commit should I use ? 

Sent from my iPad

> On Sep 1, 2015, at 6:57 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
> released. In practice I think it's simpler to follow the defaults of
> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
> any 1.5.0-rc is released. It doesn't affect later RCs. This has
> nothing to do with what commits go into 1.5.0; it's an ignorable
> detail of the version in POMs in the source tree, which don't mean
> much anyway as the source tree itself is not a released version.
> 
>> On Tue, Sep 1, 2015 at 2:48 PM,  <ch...@alpinenow.com> wrote:
>> Sorry, I am still not follow. I assume the release would build from 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot) ?
>> 
>> 
>> 
>> Sent from my iPad
>> 
>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>> 
>>> That's correct for the 1.5 branch, right? this doesn't mean that the
>>> next RC would have this value. You choose the release version during
>>> the release process.
>>> 
>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ch...@alpinenow.com> wrote:
>>>> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>>>> 
>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>>>> 
>>>> Chester
>>>> 
>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>>> 
>>>>> I'm going to -1 the release myself since the issue @yhuai identified is
>>>>> pretty serious. It basically OOMs the driver for reading any files with a
>>>>> large number of partitions. Looks like the patch for that has already been
>>>>> merged.
>>>>> 
>>>>> I'm going to cut rc3 momentarily.
>>>>> 
>>>>> 
>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
>>>>> wrote:
>>>>>> 
>>>>>> +1 (non-binding)
>>>>>> built from source and ran some jobs against YARN
>>>>>> 
>>>>>> -Sandy
>>>>>> 
>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Vaquar khan
>>>>>>> 
>>>>>>> +1 (non-binding, of course)
>>>>>>> 
>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>>>> 2. Tested pyspark, mllib
>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>>>> 2.4. KMeans OK
>>>>>>>      Center And Scale OK
>>>>>>> 2.5. RDD operations OK
>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>>>>> count)
>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>>      Model evaluation/optimization (rank, numIter, lambda) with
>>>>>>> itertools OK
>>>>>>> 3. Scala - MLlib
>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>> 3.2. LinearRegressionWithSGD OK
>>>>>>> 3.3. Decision Tree OK
>>>>>>> 3.4. KMeans OK
>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>> 3.6. saveAsParquetFile OK
>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>>>> registerTempTable, sql OK
>>>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>>>>> 4.0. Spark SQL from Python OK
>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>>>>> OK
>>>>>>> 5.0. Packages
>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>>>> 6.0. DataFrames
>>>>>>> 6.1. cast,dtypes OK
>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>>>> 6.3. joins,sql,set operations,udf OK
>>>>>>> 
>>>>>>> Cheers
>>>>>>> <k/>
>>>>>>> 
>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
>>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>> 
>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>> 
>>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>>>> 
>>>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>>>> 
>>>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>>>> 
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>> 
>>>>>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>>>>>> found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>>>> 
>>>>>>>> The staging repository for this release (published as 1.5.0) can be
>>>>>>>> found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>>>> 
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> =======================================
>>>>>>>> How can I help test this release?
>>>>>>>> =======================================
>>>>>>>> If you are a Spark user, you can help us test this release by taking an
>>>>>>>> existing Spark workload and running on this release candidate, then
>>>>>>>> reporting any regressions.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ================================================
>>>>>>>> What justifies a -1 vote for this release?
>>>>>>>> ================================================
>>>>>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>>>>>> votes should only occur for significant regressions from 1.4. Bugs already
>>>>>>>> present in 1.4, minor regressions, or bugs related to new features will not
>>>>>>>> block this release.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ===============================================================
>>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>>>> ===============================================================
>>>>>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>>>>>> branch-1.5, since documentations will be packaged separately from the
>>>>>>>> release.
>>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>>>>>> target version.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ==================================================
>>>>>>>> Major changes to help you focus your testing
>>>>>>>> ==================================================
>>>>>>>> 
>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>>>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>>>> 
>>>>>>>> RDD/DataFrame/SQL APIs
>>>>>>>> 
>>>>>>>> - New UDAF interface
>>>>>>>> - DataFrame hints for broadcast join
>>>>>>>> - expr function for turning a SQL expression into DataFrame column
>>>>>>>> - Improved support for NaN values
>>>>>>>> - StructType now supports ordering
>>>>>>>> - TimestampType precision is reduced to 1us
>>>>>>>> - 100 new built-in expressions, including date/time, string, math
>>>>>>>> - memory and local disk only checkpointing
>>>>>>>> 
>>>>>>>> DataFrame/SQL Backend Execution
>>>>>>>> 
>>>>>>>> - Code generation on by default
>>>>>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>>>>>> algorithms and external algorithms
>>>>>>>> - Improved window function performance
>>>>>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>>>>>> plans
>>>>>>>> 
>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>>>> 
>>>>>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>>>>>> Standalone)
>>>>>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>>>>>> allocation, constraints)
>>>>>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>>>>> - Support persisting data in Hive compatible format in metastore
>>>>>>>> - Support data partitioning for JSON data sources
>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>>>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>>>>>> Parquet files generated by other libraries)
>>>>>>>> - Faster and more robust dynamic partition insert
>>>>>>>> - DataSourceRegister interface for external data sources to specify
>>>>>>>> short names
>>>>>>>> 
>>>>>>>> SparkR
>>>>>>>> 
>>>>>>>> - YARN cluster mode in R
>>>>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>>>>> regularization
>>>>>>>> - Improved error messages
>>>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>>>> 
>>>>>>>> Streaming
>>>>>>>> 
>>>>>>>> - Backpressure for handling bursty input streams.
>>>>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>>>>> Kinesis, MQTT, Flume)
>>>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>>>>> linear regression, logistic regression)
>>>>>>>> - Native reliable Kinesis stream support
>>>>>>>> - Input metadata like Kafka offsets made visible in the batch details
>>>>>>>> UI
>>>>>>>> - Better load balancing and scheduling of receivers across cluster
>>>>>>>> - Include streaming storage in web UI
>>>>>>>> 
>>>>>>>> Machine Learning and Advanced Analytics
>>>>>>>> 
>>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>>>>>>>> VectorSlicer.
>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>>>>>> regression.
>>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>>>>> Kolmogorov-Smirnov test.
>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>>>> - Model summary for linear and logistic regression.
>>>>>>>> - Python API: distributed matrices, streaming k-means and linear
>>>>>>>> models, LDA, power iteration clustering, etc.
>>>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>>>> classification evaluator.
>>>>>>>> - Documentation: document the release version of public API methods
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
released. In practice I think it's simpler to follow the defaults of
the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
any 1.5.0-rc is released. It doesn't affect later RCs. This has
nothing to do with what commits go into 1.5.0; it's an ignorable
detail of the version in POMs in the source tree, which don't mean
much anyway as the source tree itself is not a released version.

On Tue, Sep 1, 2015 at 2:48 PM,  <ch...@alpinenow.com> wrote:
> Sorry, I am still not follow. I assume the release would build from 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot) ?
>
>
>
> Sent from my iPad
>
>> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> That's correct for the 1.5 branch, right? this doesn't mean that the
>> next RC would have this value. You choose the release version during
>> the release process.
>>
>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ch...@alpinenow.com> wrote:
>>> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>>>
>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>>>
>>> Chester
>>>
>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>> I'm going to -1 the release myself since the issue @yhuai identified is
>>>> pretty serious. It basically OOMs the driver for reading any files with a
>>>> large number of partitions. Looks like the patch for that has already been
>>>> merged.
>>>>
>>>> I'm going to cut rc3 momentarily.
>>>>
>>>>
>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
>>>> wrote:
>>>>>
>>>>> +1 (non-binding)
>>>>> built from source and ran some jobs against YARN
>>>>>
>>>>> -Sandy
>>>>>
>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>>>
>>>>>> Regards,
>>>>>> Vaquar khan
>>>>>>
>>>>>> +1 (non-binding, of course)
>>>>>>
>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>>     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>>> 2. Tested pyspark, mllib
>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>>> 2.4. KMeans OK
>>>>>>       Center And Scale OK
>>>>>> 2.5. RDD operations OK
>>>>>>      State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>>>> count)
>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>       Model evaluation/optimization (rank, numIter, lambda) with
>>>>>> itertools OK
>>>>>> 3. Scala - MLlib
>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>> 3.2. LinearRegressionWithSGD OK
>>>>>> 3.3. Decision Tree OK
>>>>>> 3.4. KMeans OK
>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>> 3.6. saveAsParquetFile OK
>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>>> registerTempTable, sql OK
>>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>>>> 4.0. Spark SQL from Python OK
>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>>>> OK
>>>>>> 5.0. Packages
>>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>>> 6.0. DataFrames
>>>>>> 6.1. cast,dtypes OK
>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>>> 6.3. joins,sql,set operations,udf OK
>>>>>>
>>>>>> Cheers
>>>>>> <k/>
>>>>>>
>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>
>>>>>>>
>>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>>>
>>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>>>>> found at:
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>>>
>>>>>>> The staging repository for this release (published as 1.5.0) can be
>>>>>>> found at:
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>>>
>>>>>>>
>>>>>>> =======================================
>>>>>>> How can I help test this release?
>>>>>>> =======================================
>>>>>>> If you are a Spark user, you can help us test this release by taking an
>>>>>>> existing Spark workload and running on this release candidate, then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>>
>>>>>>> ================================================
>>>>>>> What justifies a -1 vote for this release?
>>>>>>> ================================================
>>>>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>>>>> votes should only occur for significant regressions from 1.4. Bugs already
>>>>>>> present in 1.4, minor regressions, or bugs related to new features will not
>>>>>>> block this release.
>>>>>>>
>>>>>>>
>>>>>>> ===============================================================
>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>>> ===============================================================
>>>>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>>>>> branch-1.5, since documentations will be packaged separately from the
>>>>>>> release.
>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>>>>> target version.
>>>>>>>
>>>>>>>
>>>>>>> ==================================================
>>>>>>> Major changes to help you focus your testing
>>>>>>> ==================================================
>>>>>>>
>>>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>>>
>>>>>>> RDD/DataFrame/SQL APIs
>>>>>>>
>>>>>>> - New UDAF interface
>>>>>>> - DataFrame hints for broadcast join
>>>>>>> - expr function for turning a SQL expression into DataFrame column
>>>>>>> - Improved support for NaN values
>>>>>>> - StructType now supports ordering
>>>>>>> - TimestampType precision is reduced to 1us
>>>>>>> - 100 new built-in expressions, including date/time, string, math
>>>>>>> - memory and local disk only checkpointing
>>>>>>>
>>>>>>> DataFrame/SQL Backend Execution
>>>>>>>
>>>>>>> - Code generation on by default
>>>>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>>>>> algorithms and external algorithms
>>>>>>> - Improved window function performance
>>>>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>>>>> plans
>>>>>>>
>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>>>
>>>>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>>>>> Standalone)
>>>>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>>>>> allocation, constraints)
>>>>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>>>> - Support persisting data in Hive compatible format in metastore
>>>>>>> - Support data partitioning for JSON data sources
>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>>>>> Parquet files generated by other libraries)
>>>>>>> - Faster and more robust dynamic partition insert
>>>>>>> - DataSourceRegister interface for external data sources to specify
>>>>>>> short names
>>>>>>>
>>>>>>> SparkR
>>>>>>>
>>>>>>> - YARN cluster mode in R
>>>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>>>> regularization
>>>>>>> - Improved error messages
>>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>>>
>>>>>>> Streaming
>>>>>>>
>>>>>>> - Backpressure for handling bursty input streams.
>>>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>>>> Kinesis, MQTT, Flume)
>>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>>>> linear regression, logistic regression)
>>>>>>> - Native reliable Kinesis stream support
>>>>>>> - Input metadata like Kafka offsets made visible in the batch details
>>>>>>> UI
>>>>>>> - Better load balancing and scheduling of receivers across cluster
>>>>>>> - Include streaming storage in web UI
>>>>>>>
>>>>>>> Machine Learning and Advanced Analytics
>>>>>>>
>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>>>>>>> VectorSlicer.
>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>>>>> regression.
>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>>>> Kolmogorov-Smirnov test.
>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>>> - Model summary for linear and logistic regression.
>>>>>>> - Python API: distributed matrices, streaming k-means and linear
>>>>>>> models, LDA, power iteration clustering, etc.
>>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>>> classification evaluator.
>>>>>>> - Documentation: document the release version of public API methods
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by ch...@alpinenow.com.

Sorry, I am still not follow. I assume the release would build from 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot) ?



Sent from my iPad

> On Sep 1, 2015, at 1:52 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> That's correct for the 1.5 branch, right? this doesn't mean that the
> next RC would have this value. You choose the release version during
> the release process.
> 
>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ch...@alpinenow.com> wrote:
>> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>> 
>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>> 
>> Chester
>> 
>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>> 
>>> I'm going to -1 the release myself since the issue @yhuai identified is
>>> pretty serious. It basically OOMs the driver for reading any files with a
>>> large number of partitions. Looks like the patch for that has already been
>>> merged.
>>> 
>>> I'm going to cut rc3 momentarily.
>>> 
>>> 
>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
>>> wrote:
>>>> 
>>>> +1 (non-binding)
>>>> built from source and ran some jobs against YARN
>>>> 
>>>> -Sandy
>>>> 
>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> 
>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>> 
>>>>> Regards,
>>>>> Vaquar khan
>>>>> 
>>>>> +1 (non-binding, of course)
>>>>> 
>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>> 2. Tested pyspark, mllib
>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>> 2.4. KMeans OK
>>>>>       Center And Scale OK
>>>>> 2.5. RDD operations OK
>>>>>      State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>>> count)
>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>       Model evaluation/optimization (rank, numIter, lambda) with
>>>>> itertools OK
>>>>> 3. Scala - MLlib
>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>> 3.2. LinearRegressionWithSGD OK
>>>>> 3.3. Decision Tree OK
>>>>> 3.4. KMeans OK
>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>> 3.6. saveAsParquetFile OK
>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>> registerTempTable, sql OK
>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>>> 4.0. Spark SQL from Python OK
>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>>> OK
>>>>> 5.0. Packages
>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>> 6.0. DataFrames
>>>>> 6.1. cast,dtypes OK
>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>> 6.3. joins,sql,set operations,udf OK
>>>>> 
>>>>> Cheers
>>>>> <k/>
>>>>> 
>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>> 
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>> 
>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>> 
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>> 
>>>>>> 
>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>> 
>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>> 
>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>> 
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>> 
>>>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>>>> found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>> 
>>>>>> The staging repository for this release (published as 1.5.0) can be
>>>>>> found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>> 
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>> 
>>>>>> 
>>>>>> =======================================
>>>>>> How can I help test this release?
>>>>>> =======================================
>>>>>> If you are a Spark user, you can help us test this release by taking an
>>>>>> existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>> 
>>>>>> 
>>>>>> ================================================
>>>>>> What justifies a -1 vote for this release?
>>>>>> ================================================
>>>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>>>> votes should only occur for significant regressions from 1.4. Bugs already
>>>>>> present in 1.4, minor regressions, or bugs related to new features will not
>>>>>> block this release.
>>>>>> 
>>>>>> 
>>>>>> ===============================================================
>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>> ===============================================================
>>>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>>>> branch-1.5, since documentations will be packaged separately from the
>>>>>> release.
>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>>>> target version.
>>>>>> 
>>>>>> 
>>>>>> ==================================================
>>>>>> Major changes to help you focus your testing
>>>>>> ==================================================
>>>>>> 
>>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>> 
>>>>>> RDD/DataFrame/SQL APIs
>>>>>> 
>>>>>> - New UDAF interface
>>>>>> - DataFrame hints for broadcast join
>>>>>> - expr function for turning a SQL expression into DataFrame column
>>>>>> - Improved support for NaN values
>>>>>> - StructType now supports ordering
>>>>>> - TimestampType precision is reduced to 1us
>>>>>> - 100 new built-in expressions, including date/time, string, math
>>>>>> - memory and local disk only checkpointing
>>>>>> 
>>>>>> DataFrame/SQL Backend Execution
>>>>>> 
>>>>>> - Code generation on by default
>>>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>>>> algorithms and external algorithms
>>>>>> - Improved window function performance
>>>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>>>> plans
>>>>>> 
>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>> 
>>>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>>>> Standalone)
>>>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>>>> allocation, constraints)
>>>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>>> - Support persisting data in Hive compatible format in metastore
>>>>>> - Support data partitioning for JSON data sources
>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>>>> Parquet files generated by other libraries)
>>>>>> - Faster and more robust dynamic partition insert
>>>>>> - DataSourceRegister interface for external data sources to specify
>>>>>> short names
>>>>>> 
>>>>>> SparkR
>>>>>> 
>>>>>> - YARN cluster mode in R
>>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>>> regularization
>>>>>> - Improved error messages
>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>> 
>>>>>> Streaming
>>>>>> 
>>>>>> - Backpressure for handling bursty input streams.
>>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>>> Kinesis, MQTT, Flume)
>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>>> linear regression, logistic regression)
>>>>>> - Native reliable Kinesis stream support
>>>>>> - Input metadata like Kafka offsets made visible in the batch details
>>>>>> UI
>>>>>> - Better load balancing and scheduling of receivers across cluster
>>>>>> - Include streaming storage in web UI
>>>>>> 
>>>>>> Machine Learning and Advanced Analytics
>>>>>> 
>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>>>>>> VectorSlicer.
>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>>>> regression.
>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>>> Kolmogorov-Smirnov test.
>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>> - Model summary for linear and logistic regression.
>>>>>> - Python API: distributed matrices, streaming k-means and linear
>>>>>> models, LDA, power iteration clustering, etc.
>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>> classification evaluator.
>>>>>> - Documentation: document the release version of public API methods
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

That's correct for the 1.5 branch, right? this doesn't mean that the
next RC would have this value. You choose the release version during
the release process.

On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <ch...@alpinenow.com> wrote:
> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>
> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>
> Chester
>
> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> I'm going to -1 the release myself since the issue @yhuai identified is
>> pretty serious. It basically OOMs the driver for reading any files with a
>> large number of partitions. Looks like the patch for that has already been
>> merged.
>>
>> I'm going to cut rc3 momentarily.
>>
>>
>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
>> wrote:
>>>
>>> +1 (non-binding)
>>> built from source and ran some jobs against YARN
>>>
>>> -Sandy
>>>
>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
>>> wrote:
>>>>
>>>>
>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> +1 (non-binding, of course)
>>>>
>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>> 2. Tested pyspark, mllib
>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>> 2.3. Decision Tree, Naive Bayes OK
>>>> 2.4. KMeans OK
>>>>        Center And Scale OK
>>>> 2.5. RDD operations OK
>>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>> count)
>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>        Model evaluation/optimization (rank, numIter, lambda) with
>>>> itertools OK
>>>> 3. Scala - MLlib
>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>> 3.2. LinearRegressionWithSGD OK
>>>> 3.3. Decision Tree OK
>>>> 3.4. KMeans OK
>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>> 3.6. saveAsParquetFile OK
>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>> registerTempTable, sql OK
>>>> 3.8. result = sqlContext.sql("SELECT
>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>> 4.0. Spark SQL from Python OK
>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>> OK
>>>> 5.0. Packages
>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>> 6.0. DataFrames
>>>> 6.1. cast,dtypes OK
>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>> 6.3. joins,sql,set operations,udf OK
>>>>
>>>> Cheers
>>>> <k/>
>>>>
>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>>
>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>
>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>>> found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>
>>>>> The staging repository for this release (published as 1.5.0) can be
>>>>> found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>
>>>>>
>>>>> =======================================
>>>>> How can I help test this release?
>>>>> =======================================
>>>>> If you are a Spark user, you can help us test this release by taking an
>>>>> existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>>
>>>>> ================================================
>>>>> What justifies a -1 vote for this release?
>>>>> ================================================
>>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>>> votes should only occur for significant regressions from 1.4. Bugs already
>>>>> present in 1.4, minor regressions, or bugs related to new features will not
>>>>> block this release.
>>>>>
>>>>>
>>>>> ===============================================================
>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>> ===============================================================
>>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>>> branch-1.5, since documentations will be packaged separately from the
>>>>> release.
>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>>> target version.
>>>>>
>>>>>
>>>>> ==================================================
>>>>> Major changes to help you focus your testing
>>>>> ==================================================
>>>>>
>>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>
>>>>> RDD/DataFrame/SQL APIs
>>>>>
>>>>> - New UDAF interface
>>>>> - DataFrame hints for broadcast join
>>>>> - expr function for turning a SQL expression into DataFrame column
>>>>> - Improved support for NaN values
>>>>> - StructType now supports ordering
>>>>> - TimestampType precision is reduced to 1us
>>>>> - 100 new built-in expressions, including date/time, string, math
>>>>> - memory and local disk only checkpointing
>>>>>
>>>>> DataFrame/SQL Backend Execution
>>>>>
>>>>> - Code generation on by default
>>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>>> algorithms and external algorithms
>>>>> - Improved window function performance
>>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>>> plans
>>>>>
>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>
>>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>>> Standalone)
>>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>>> allocation, constraints)
>>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>> - Support persisting data in Hive compatible format in metastore
>>>>> - Support data partitioning for JSON data sources
>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>>> Parquet files generated by other libraries)
>>>>> - Faster and more robust dynamic partition insert
>>>>> - DataSourceRegister interface for external data sources to specify
>>>>> short names
>>>>>
>>>>> SparkR
>>>>>
>>>>> - YARN cluster mode in R
>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>> regularization
>>>>> - Improved error messages
>>>>> - Aliases to make DataFrame functions more R-like
>>>>>
>>>>> Streaming
>>>>>
>>>>> - Backpressure for handling bursty input streams.
>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>> Kinesis, MQTT, Flume)
>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>> linear regression, logistic regression)
>>>>> - Native reliable Kinesis stream support
>>>>> - Input metadata like Kafka offsets made visible in the batch details
>>>>> UI
>>>>> - Better load balancing and scheduling of receivers across cluster
>>>>> - Include streaming storage in web UI
>>>>>
>>>>> Machine Learning and Advanced Analytics
>>>>>
>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>>>>> VectorSlicer.
>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>>> regression.
>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>> Kolmogorov-Smirnov test.
>>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>>> - More efficient Pregel API implementation for GraphX
>>>>> - Model summary for linear and logistic regression.
>>>>> - Python API: distributed matrices, streaming k-means and linear
>>>>> models, LDA, power iteration clustering, etc.
>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>> classification evaluator.
>>>>> - Documentation: document the release version of public API methods
>>>>>
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Chester Chen <ch...@alpinenow.com>.

Seems that Github branch-1.5 already changing the version to
1.5.1-SNAPSHOT,

I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?

Chester

On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rx...@databricks.com> wrote:

> I'm going to -1 the release myself since the issue @yhuai identified is
> pretty serious. It basically OOMs the driver for reading any files with a
> large number of partitions. Looks like the patch for that has already been
> merged.
>
> I'm going to cut rc3 momentarily.
>
>
> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>
>> +1 (non-binding)
>> built from source and ran some jobs against YARN
>>
>> -Sandy
>>
>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
>> wrote:
>>
>>>
>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>
>>> Regards,
>>> Vaquar khan
>>> +1 (non-binding, of course)
>>>
>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>> 2. Tested pyspark, mllib
>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 2.2. Linear/Ridge/Laso Regression OK
>>> 2.3. Decision Tree, Naive Bayes OK
>>> 2.4. KMeans OK
>>>        Center And Scale OK
>>> 2.5. RDD operations OK
>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>        Model evaluation/optimization (rank, numIter, lambda) with
>>> itertools OK
>>> 3. Scala - MLlib
>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 3.2. LinearRegressionWithSGD OK
>>> 3.3. Decision Tree OK
>>> 3.4. KMeans OK
>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>> 3.6. saveAsParquetFile OK
>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>> registerTempTable, sql OK
>>> 3.8. result = sqlContext.sql("SELECT
>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>> 4.0. Spark SQL from Python OK
>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>> OK
>>> 5.0. Packages
>>> 5.1. com.databricks.spark.csv - read/write OK
>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>> 6.0. DataFrames
>>> 6.1. cast,dtypes OK
>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>> 6.3. joins,sql,set operations,udf OK
>>>
>>> Cheers
>>> <k/>
>>>
>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>>
>>>> The tag to be voted on is v1.5.0-rc2:
>>>>
>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>
>>>> The staging repository for this release (published as 1.5.0) can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>
>>>>
>>>> =======================================
>>>> How can I help test this release?
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>>
>>>> ================================================
>>>> What justifies a -1 vote for this release?
>>>> ================================================
>>>> This vote is happening towards the end of the 1.5 QA period, so -1
>>>> votes should only occur for significant regressions from 1.4. Bugs already
>>>> present in 1.4, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>>
>>>> ===============================================================
>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>> branch-1.5, since documentations will be packaged separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> Major changes to help you focus your testing
>>>> ==================================================
>>>>
>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>> complete list, please refer to Apache JIRA changelog.
>>>>
>>>> RDD/DataFrame/SQL APIs
>>>>
>>>> - New UDAF interface
>>>> - DataFrame hints for broadcast join
>>>> - expr function for turning a SQL expression into DataFrame column
>>>> - Improved support for NaN values
>>>> - StructType now supports ordering
>>>> - TimestampType precision is reduced to 1us
>>>> - 100 new built-in expressions, including date/time, string, math
>>>> - memory and local disk only checkpointing
>>>>
>>>> DataFrame/SQL Backend Execution
>>>>
>>>> - Code generation on by default
>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>> algorithms and external algorithms
>>>> - Improved window function performance
>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>> plans
>>>>
>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>
>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>> Standalone)
>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>> allocation, constraints)
>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>> - Improved Hive support (metastore partition pruning, metastore
>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>> - Support persisting data in Hive compatible format in metastore
>>>> - Support data partitioning for JSON data sources
>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>> Parquet files generated by other libraries)
>>>> - Faster and more robust dynamic partition insert
>>>> - DataSourceRegister interface for external data sources to specify
>>>> short names
>>>>
>>>> SparkR
>>>>
>>>> - YARN cluster mode in R
>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>> regularization
>>>> - Improved error messages
>>>> - Aliases to make DataFrame functions more R-like
>>>>
>>>> Streaming
>>>>
>>>> - Backpressure for handling bursty input streams.
>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>> Kinesis, MQTT, Flume)
>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>> linear regression, logistic regression)
>>>> - Native reliable Kinesis stream support
>>>> - Input metadata like Kafka offsets made visible in the batch details UI
>>>> - Better load balancing and scheduling of receivers across cluster
>>>> - Include streaming storage in web UI
>>>>
>>>> Machine Learning and Advanced Analytics
>>>>
>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>>>> VectorSlicer.
>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>> regression.
>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>> sequential pattern mining, association rule generation, 1-sample
>>>> Kolmogorov-Smirnov test.
>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>> - More efficient Pregel API implementation for GraphX
>>>> - Model summary for linear and logistic regression.
>>>> - Python API: distributed matrices, streaming k-means and linear
>>>> models, LDA, power iteration clustering, etc.
>>>> - Tuning and evaluation: train-validation split and multiclass
>>>> classification evaluator.
>>>> - Documentation: document the release version of public API methods
>>>>
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Reynold Xin <rx...@databricks.com>.

I'm going to -1 the release myself since the issue @yhuai identified is
pretty serious. It basically OOMs the driver for reading any files with a
large number of partitions. Looks like the patch for that has already been
merged.

I'm going to cut rc3 momentarily.


On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sa...@cloudera.com>
wrote:

> +1 (non-binding)
> built from source and ran some jobs against YARN
>
> -Sandy
>
> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com>
> wrote:
>
>>
>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>
>> Regards,
>> Vaquar khan
>> +1 (non-binding, of course)
>>
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>> 2. Tested pyspark, mllib
>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>        Center And Scale OK
>> 2.5. RDD operations OK
>>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>        Model evaluation/optimization (rank, numIter, lambda) with
>> itertools OK
>> 3. Scala - MLlib
>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 3.2. LinearRegressionWithSGD OK
>> 3.3. Decision Tree OK
>> 3.4. KMeans OK
>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> 3.6. saveAsParquetFile OK
>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>> registerTempTable, sql OK
>> 3.8. result = sqlContext.sql("SELECT
>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>> 4.0. Spark SQL from Python OK
>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
>> 5.0. Packages
>> 5.1. com.databricks.spark.csv - read/write OK
>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>> com.databricks:spark-csv_2.11:1.2.0 worked)
>> 6.0. DataFrames
>> 6.1. cast,dtypes OK
>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>> 6.3. joins,sql,set operations,udf OK
>>
>> Cheers
>> <k/>
>>
>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>>
>>> The tag to be voted on is v1.5.0-rc2:
>>>
>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release (published as 1.5.0-rc2) can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>
>>> The staging repository for this release (published as 1.5.0) can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>
>>>
>>> =======================================
>>> How can I help test this release?
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>>
>>> ================================================
>>> What justifies a -1 vote for this release?
>>> ================================================
>>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>>> should only occur for significant regressions from 1.4. Bugs already
>>> present in 1.4, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>>
>>> ===============================================================
>>> What should happen to JIRA tickets still targeting 1.5.0?
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>> branch-1.5, since documentations will be packaged separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.6+.
>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> Major changes to help you focus your testing
>>> ==================================================
>>>
>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>> contributors. I've curated a list of important changes for 1.5. For the
>>> complete list, please refer to Apache JIRA changelog.
>>>
>>> RDD/DataFrame/SQL APIs
>>>
>>> - New UDAF interface
>>> - DataFrame hints for broadcast join
>>> - expr function for turning a SQL expression into DataFrame column
>>> - Improved support for NaN values
>>> - StructType now supports ordering
>>> - TimestampType precision is reduced to 1us
>>> - 100 new built-in expressions, including date/time, string, math
>>> - memory and local disk only checkpointing
>>>
>>> DataFrame/SQL Backend Execution
>>>
>>> - Code generation on by default
>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>> algorithms and external algorithms
>>> - Improved window function performance
>>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>>
>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>
>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>> Standalone)
>>> - Improved Mesos support (framework authentication, roles, dynamic
>>> allocation, constraints)
>>> - Improved YARN support (dynamic allocation with preferred locations)
>>> - Improved Hive support (metastore partition pruning, metastore
>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>> - Support persisting data in Hive compatible format in metastore
>>> - Support data partitioning for JSON data sources
>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>> metadata discovery and schema merging, support reading non-standard legacy
>>> Parquet files generated by other libraries)
>>> - Faster and more robust dynamic partition insert
>>> - DataSourceRegister interface for external data sources to specify
>>> short names
>>>
>>> SparkR
>>>
>>> - YARN cluster mode in R
>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>> regularization
>>> - Improved error messages
>>> - Aliases to make DataFrame functions more R-like
>>>
>>> Streaming
>>>
>>> - Backpressure for handling bursty input streams.
>>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>>> MQTT, Flume)
>>> - Improved Python streaming machine learning algorithms (K-Means, linear
>>> regression, logistic regression)
>>> - Native reliable Kinesis stream support
>>> - Input metadata like Kafka offsets made visible in the batch details UI
>>> - Better load balancing and scheduling of receivers across cluster
>>> - Include streaming storage in web UI
>>>
>>> Machine Learning and Advanced Analytics
>>>
>>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>> regression.
>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>> sequential pattern mining, association rule generation, 1-sample
>>> Kolmogorov-Smirnov test.
>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>> - More efficient Pregel API implementation for GraphX
>>> - Model summary for linear and logistic regression.
>>> - Python API: distributed matrices, streaming k-means and linear models,
>>> LDA, power iteration clustering, etc.
>>> - Tuning and evaluation: train-validation split and multiclass
>>> classification evaluator.
>>> - Documentation: document the release version of public API methods
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Sandy Ryza <sa...@cloudera.com>.

+1 (non-binding)
built from source and ran some jobs against YARN

-Sandy

On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <va...@gmail.com> wrote:

>
> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>
> Regards,
> Vaquar khan
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>        Center And Scale OK
> 2.5. RDD operations OK
>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>        Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. joins,sql,set operations,udf OK
>
> Cheers
> <k/>
>
> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc2:
>>
>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc2) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>
>>
>> =======================================
>> How can I help test this release?
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> ================================================
>> What justifies a -1 vote for this release?
>> ================================================
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===============================================================
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> Major changes to help you focus your testing
>> ==================================================
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, including date/time, string, math
>> - memory and local disk only checkpointing
>>
>> DataFrame/SQL Backend Execution
>>
>> - Code generation on by default
>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> algorithms and external algorithms
>> - Improved window function performance
>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>
>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>
>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> Standalone)
>> - Improved Mesos support (framework authentication, roles, dynamic
>> allocation, constraints)
>> - Improved YARN support (dynamic allocation with preferred locations)
>> - Improved Hive support (metastore partition pruning, metastore
>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>> - Support persisting data in Hive compatible format in metastore
>> - Support data partitioning for JSON data sources
>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>> metadata discovery and schema merging, support reading non-standard legacy
>> Parquet files generated by other libraries)
>> - Faster and more robust dynamic partition insert
>> - DataSourceRegister interface for external data sources to specify short
>> names
>>
>> SparkR
>>
>> - YARN cluster mode in R
>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>> regularization
>> - Improved error messages
>> - Aliases to make DataFrame functions more R-like
>>
>> Streaming
>>
>> - Backpressure for handling bursty input streams.
>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>> MQTT, Flume)
>> - Improved Python streaming machine learning algorithms (K-Means, linear
>> regression, logistic regression)
>> - Native reliable Kinesis stream support
>> - Input metadata like Kafka offsets made visible in the batch details UI
>> - Better load balancing and scheduling of receivers across cluster
>> - Include streaming storage in web UI
>>
>> Machine Learning and Advanced Analytics
>>
>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>> regression.
>> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
>> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
>> test.
>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>> - More efficient Pregel API implementation for GraphX
>> - Model summary for linear and logistic regression.
>> - Python API: distributed matrices, streaming k-means and linear models,
>> LDA, power iteration clustering, etc.
>> - Tuning and evaluation: train-validation split and multiclass
>> classification evaluator.
>> - Documentation: document the release version of public API methods
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by vaquar khan <va...@gmail.com>.

+1 (1.5.0 RC2)Compiled on Windows with YARN.

Regards,
Vaquar khan
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK

Cheers
<k/>

On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Krishna Sankar <ks...@gmail.com>.

+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK

Cheers
<k/>

On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Sen Fang <fo...@outlook.com>.

Agree on the line fix. I'm submitting from Windows to YARN running on
Linux. I imagine that this isn't that uncommon especially for developers
working in corporate setting.

On Thu, Aug 27, 2015 at 12:52 PM Marcelo Vanzin <va...@cloudera.com> wrote:

> Are you just submitting from Windows or are you also running YARN on
> Windows?
>
> If the former, I think the only fix that would be needed is this line
> (from that same patch):
>
> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L434
>
> I don't believe YARN running on Windows worked at all before that
> patch (regardless of that individual issue). I'll leave it to Reynold
> whether Windows support is critical enough to warrant a new rc.
>
>
> On Thu, Aug 27, 2015 at 8:50 AM, saurfang <fo...@outlook.com> wrote:
> > Nevermind. It looks like this has been fixed in
> > https://github.com/apache/spark/pull/8053 but didn't make the cut? Even
> > though the associated JIRA is targeted for 1.6, I was able to submit to
> YARN
> > from Windows without a problem with 1.4. I'm wondering if this fix will
> be
> > merged to 1.5 branch. Let me know if someone thinks I'm just not doing
> the
> > compile and/or spark-submit right.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13872.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Marcelo Vanzin <va...@cloudera.com>.

Are you just submitting from Windows or are you also running YARN on Windows?

If the former, I think the only fix that would be needed is this line
(from that same patch):
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L434

I don't believe YARN running on Windows worked at all before that
patch (regardless of that individual issue). I'll leave it to Reynold
whether Windows support is critical enough to warrant a new rc.


On Thu, Aug 27, 2015 at 8:50 AM, saurfang <fo...@outlook.com> wrote:
> Nevermind. It looks like this has been fixed in
> https://github.com/apache/spark/pull/8053 but didn't make the cut? Even
> though the associated JIRA is targeted for 1.6, I was able to submit to YARN
> from Windows without a problem with 1.4. I'm wondering if this fix will be
> merged to 1.5 branch. Let me know if someone thinks I'm just not doing the
> compile and/or spark-submit right.
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13872.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by saurfang <fo...@outlook.com>.

Nevermind. It looks like this has been fixed in
https://github.com/apache/spark/pull/8053 but didn't make the cut? Even
though the associated JIRA is targeted for 1.6, I was able to submit to YARN
from Windows without a problem with 1.4. I'm wondering if this fix will be
merged to 1.5 branch. Let me know if someone thinks I'm just not doing the
compile and/or spark-submit right.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13872.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by saurfang <fo...@outlook.com>.

Compiled on Windows with YARN and HIVE. However I got exception when
submitting application to YARN due to: 

java.net.URISyntaxException: Illegal character in opaque part at index 2:
D:\TEMP\spark-b32c5b5b-a9fa-4cfd-a233-3977588d4092\__spark_conf__1960856096319316224.zip
        at java.net.URI$Parser.fail(URI.java:2829)
        at java.net.URI$Parser.checkChars(URI.java:3002)
        at java.net.URI$Parser.parse(URI.java:3039)
        at java.net.URI.<init>(URI.java:595)
        at
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:321)
        at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:417)

It looks like either we can do `new File(path).toURI` at here:
https://github.com/apache/spark/blob/v1.5.0-rc2/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L321

Or make sure the file path use '/' separator here:
https://github.com/apache/spark/blob/v1.5.0-rc2/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L417




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13871.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Reynold Xin <rx...@databricks.com>.

Marcelo - please submit a patch anyway. If we don't include it in this
release, it will go into 1.5.1.

On Thu, Aug 27, 2015 at 4:56 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> On Thu, Aug 27, 2015 at 4:42 PM, Marcelo Vanzin <va...@cloudera.com>
> wrote:
> > The Windows issue Sen raised could be considered a regression /
> > blocker, though, and it's a one line fix. If we feel that's important,
> > let me know and I'll put up a PR against branch-1.5.
>
> Looks like Josh just found a blocker, so maybe we can squeeze this in?
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Thu, Aug 27, 2015 at 4:42 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> The Windows issue Sen raised could be considered a regression /
> blocker, though, and it's a one line fix. If we feel that's important,
> let me know and I'll put up a PR against branch-1.5.

Looks like Josh just found a blocker, so maybe we can squeeze this in?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

Posted by Marcelo Vanzin <va...@cloudera.com>.

+1. I tested the "without hadoop" binary package and ran our internal
tests on it with dynamic allocation both on and off.

The Windows issue Sen raised could be considered a regression /
blocker, though, and it's a one line fix. If we feel that's important,
let me know and I'll put up a PR against branch-1.5.

On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rx...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already present
> in 1.4, minor regressions, or bugs related to new features will not block
> this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore connectivity
> to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
> discovery and schema merging, support reading non-standard legacy Parquet
> files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org