You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/09/01 22:41:46 UTC

[VOTE] Release Apache Spark 1.5.0 (RC3)

Please vote on releasing the following candidate as Apache Spark version
1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/


The tag to be voted on is v1.5.0-rc3:
https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (published as 1.5.0-rc3) can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1143/

The staging repository for this release (published as 1.5.0) can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1142/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/


=======================================
How can I help test this release?
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


================================================
What justifies a -1 vote for this release?
================================================
This vote is happening towards the end of the 1.5 QA period, so -1 votes
should only occur for significant regressions from 1.4. Bugs already
present in 1.4, minor regressions, or bugs related to new features will not
block this release.


===============================================================
What should happen to JIRA tickets still targeting 1.5.0?
===============================================================
1. It is OK for documentation patches to target 1.5.0 and still go into
branch-1.5, since documentations will be packaged separately from the
release.
2. New features for non-alpha-modules should target 1.6+.
3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
version.


==================================================
Major changes to help you focus your testing
==================================================

As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.

RDD/DataFrame/SQL APIs

- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing

DataFrame/SQL Backend Execution

- Code generation on by default
- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms
- Improved window function performance
- Better metrics instrumentation and reporting for DF/SQL execution plans

Data Sources, Hive, Hadoop, Mesos and Cluster Management

- Dynamic allocation support in all resource managers (Mesos, YARN,
Standalone)
- Improved Mesos support (framework authentication, roles, dynamic
allocation, constraints)
- Improved YARN support (dynamic allocation with preferred locations)
- Improved Hive support (metastore partition pruning, metastore
connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
- Support persisting data in Hive compatible format in metastore
- Support data partitioning for JSON data sources
- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
discovery and schema merging, support reading non-standard legacy Parquet
files generated by other libraries)
- Faster and more robust dynamic partition insert
- DataSourceRegister interface for external data sources to specify short
names

SparkR

- YARN cluster mode in R
- GLMs with R formula, binomial/Gaussian families, and elastic-net
regularization
- Improved error messages
- Aliases to make DataFrame functions more R-like

Streaming

- Backpressure for handling bursty input streams.
- Improved Python support for streaming sources (Kafka offsets, Kinesis,
MQTT, Flume)
- Improved Python streaming machine learning algorithms (K-Means, linear
regression, logistic regression)
- Native reliable Kinesis stream support
- Input metadata like Kafka offsets made visible in the batch details UI
- Better load balancing and scheduling of receivers across cluster
- Include streaming storage in web UI

Machine Learning and Advanced Analytics

- Feature transformers: CountVectorizer, Discrete Cosine transformation,
MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
regression.
- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
test.
- Improvements to existing algorithms: LDA, trees/ensembles, GMMs
- More efficient Pregel API implementation for GraphX
- Model summary for linear and logistic regression.
- Python API: distributed matrices, streaming k-means and linear models,
LDA, power iteration clustering, etc.
- Tuning and evaluation: train-validation split and multiclass
classification evaluator.
- Documentation: document the release version of public API methods

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by saurfang <fo...@outlook.com>.
+1. Compiled on Windows with YARN and Hive. Tested Tungsten aggregation and
observed similar (good) performance comparing to 1.4 with unsafe on. Ran a
few workloads and tested SparkSQL thrift server



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13953.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Davies Liu <da...@databricks.com>.
+1, built 1.5 from source and ran TPC-DS locally and clusters, ran
performance benchmark for aggregation and join with difference scales,
all worked well.

On Thu, Sep 3, 2015 at 10:05 AM, Michael Armbrust
<mi...@databricks.com> wrote:
> +1 Ran TPC-DS and ported several jobs over to 1.5
>
> On Thu, Sep 3, 2015 at 9:57 AM, Burak Yavuz <br...@gmail.com> wrote:
>>
>> +1. Tested complex R package support (Scala + R code), BLAS and DataFrame
>> fixes good.
>>
>> Burak
>>
>> On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman <ma...@chango.com>
>> wrote:
>>>
>>> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
>>> Standalone without any problems. Re-tested dynamic allocation
>>> specifically.
>>>
>>> "Lost executor" messages are still an annoyance since they're expected to
>>> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
>>> however there's already a JIRA ticket for it:
>>> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
>>> filter these messages out in log4j properties for this release!
>>>
>>> Mark.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Michael Armbrust <mi...@databricks.com>.
+1 Ran TPC-DS and ported several jobs over to 1.5

On Thu, Sep 3, 2015 at 9:57 AM, Burak Yavuz <br...@gmail.com> wrote:

> +1. Tested complex R package support (Scala + R code), BLAS and DataFrame
> fixes good.
>
> Burak
>
> On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman <ma...@chango.com>
> wrote:
>
>> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
>> Standalone without any problems. Re-tested dynamic allocation
>> specifically.
>>
>> "Lost executor" messages are still an annoyance since they're expected to
>> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
>> however there's already a JIRA ticket for it:
>> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
>> filter these messages out in log4j properties for this release!
>>
>> Mark.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Burak Yavuz <br...@gmail.com>.
+1. Tested complex R package support (Scala + R code), BLAS and DataFrame
fixes good.

Burak

On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman <ma...@chango.com> wrote:

> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
> Standalone without any problems. Re-tested dynamic allocation specifically.
>
> "Lost executor" messages are still an annoyance since they're expected to
> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
> however there's already a JIRA ticket for it:
> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
> filter these messages out in log4j properties for this release!
>
> Mark.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by mkhaitman <ma...@chango.com>.
Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
Standalone without any problems. Re-tested dynamic allocation specifically. 

"Lost executor" messages are still an annoyance since they're expected to
occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
however there's already a JIRA ticket for it:
https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
filter these messages out in log4j properties for this release!

Mark.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Tom Graves <tg...@yahoo.com.INVALID>.
+1. Tested on Yarn with Hadoop 2.6. 
A few of the things tested: pyspark, hive integration, aux shuffle handler, history server, basic submit cli behavior, distributed cache behavior, cluster and client mode...
Tom 


     On Tuesday, September 1, 2015 3:42 PM, Reynold Xin <rx...@databricks.com> wrote:
   

 Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v1.5.0-rc3:https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
The release files, including signatures, digests, etc. can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
Release artifacts are signed with the following key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release (published as 1.5.0-rc3) can be found at:https://repository.apache.org/content/repositories/orgapachespark-1143/
The staging repository for this release (published as 1.5.0) can be found at:https://repository.apache.org/content/repositories/orgapachespark-1142/
The documentation corresponding to this release can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/

=======================================How can I help test this release?=======================================If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.

================================================What justifies a -1 vote for this release?================================================This vote is happening towards the end of the 1.5 QA period, so -1 votes should only occur for significant regressions from 1.4. Bugs already present in 1.4, minor regressions, or bugs related to new features will not block this release.

===============================================================What should happen to JIRA tickets still targeting 1.5.0?===============================================================1. It is OK for documentation patches to target 1.5.0 and still go into branch-1.5, since documentations will be packaged separately from the release.2. New features for non-alpha-modules should target 1.6+.3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target version.

==================================================Major changes to help you focus your testing==================================================
As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. I've curated a list of important changes for 1.5. For the complete list, please refer to Apache JIRA changelog.
RDD/DataFrame/SQL APIs
- New UDAF interface- DataFrame hints for broadcast join- expr function for turning a SQL expression into DataFrame column- Improved support for NaN values- StructType now supports ordering- TimestampType precision is reduced to 1us- 100 new built-in expressions, including date/time, string, math- memory and local disk only checkpointing
DataFrame/SQL Backend Execution
- Code generation on by default- Improved join, aggregation, shuffle, sorting with cache friendly algorithms and external algorithms- Improved window function performance- Better metrics instrumentation and reporting for DF/SQL execution plans
Data Sources, Hive, Hadoop, Mesos and Cluster Management
- Dynamic allocation support in all resource managers (Mesos, YARN, Standalone)- Improved Mesos support (framework authentication, roles, dynamic allocation, constraints)- Improved YARN support (dynamic allocation with preferred locations)- Improved Hive support (metastore partition pruning, metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)- Support persisting data in Hive compatible format in metastore- Support data partitioning for JSON data sources- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata discovery and schema merging, support reading non-standard legacy Parquet files generated by other libraries)- Faster and more robust dynamic partition insert- DataSourceRegister interface for external data sources to specify short names
SparkR
- YARN cluster mode in R- GLMs with R formula, binomial/Gaussian families, and elastic-net regularization- Improved error messages- Aliases to make DataFrame functions more R-like
Streaming
- Backpressure for handling bursty input streams.- Improved Python support for streaming sources (Kafka offsets, Kinesis, MQTT, Flume)- Improved Python streaming machine learning algorithms (K-Means, linear regression, logistic regression)- Native reliable Kinesis stream support- Input metadata like Kafka offsets made visible in the batch details UI- Better load balancing and scheduling of receivers across cluster- Include streaming storage in web UI
Machine Learning and Advanced Analytics
- Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic regression.- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov test.- Improvements to existing algorithms: LDA, trees/ensembles, GMMs- More efficient Pregel API implementation for GraphX- Model summary for linear and logistic regression.- Python API: distributed matrices, streaming k-means and linear models, LDA, power iteration clustering, etc.- Tuning and evaluation: train-validation split and multiclass classification evaluator.- Documentation: document the release version of public API methods



  

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by james <yi...@gmail.com>.
I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
corresponding description in
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
there are only 'sort' and 'hash' two options).



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13984.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Sean Owen <so...@cloudera.com>.
- As usual the license and signatures are OK
- No blockers, check
- 9 "Critical" bugs for 1.5.0 are listed below just for everyone's
reference (48 total issues still targeted for 1.5.0)
- Under Java 7 + Ubuntu 15, I only had one consistent test failure,
but obviously it's not failing in Jenkins
- I saw more test failures with Java 8 but the seemed like flaky
tests, so am pretending I didn't see that


Test failure

DirectKafkaStreamSuite:
- offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 197
times over 10.012973046 seconds. Last failure message:
strings.forall({
    ((elem: Any) => DirectKafkaStreamSuite.collectedData.contains(elem))
  }) was false. (DirectKafkaStreamSuite.scala:249)


1.5.0 Critical Bugs

SPARK-6484 Spark Core Ganglia metrics xml reporter doesn't escape
correctly Josh Rosen
SPARK-6701 Tests, YARN Flaky test: o.a.s.deploy.yarn.YarnClusterSuite
Python application
SPARK-7420 Tests Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not
clear received block data too soon" Tathagata Das
SPARK-8119 Spark Core HeartbeatReceiver should not adjust application
executor resources Andrew Or
SPARK-8414 Spark Core Ensure ContextCleaner actually triggers clean
ups Andrew Or
SPARK-8447 Shuffle Test external shuffle service with all shuffle managers
SPARK-10224 Streaming BlockGenerator may lost data in the last block
SPARK-10310 SQL [Spark SQL] All result records will be popluated into
ONE line during the script transform due to missing the correct
line/filed delimeter
SPARK-10337 SQL Views are broken


On Tue, Sep 1, 2015 at 9:41 PM, Reynold Xin <rx...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already present
> in 1.4, minor regressions, or bugs related to new features will not block
> this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore connectivity
> to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
> discovery and schema merging, support reading non-standard legacy Parquet
> files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Reynold Xin <rx...@databricks.com>.
Krishna - I think the rename happened before rc1 actually. Was done couple
months ago.

On Fri, Sep 4, 2015 at 5:00 AM, Krishna Sankar <ks...@gmail.com> wrote:

> Thanks Tom.  Interestingly it happened between RC2 and RC3.
> Now my vote is +1/2 unless the memory error is known and has a workaround.
>
> Cheers
> <k/>
>
>
> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com> wrote:
>
>> The upper/lower case thing is known.
>> https://issues.apache.org/jira/browse/SPARK-9550
>> I assume it was decided to be ok and its going to be in the release notes
>>  but Reynold or Josh can probably speak to it more.
>>
>> Tom
>>
>>
>>
>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
>> ksankar42@gmail.com> wrote:
>>
>>
>> +?
>>
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>> 2. Tested pyspark, mllib
>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>        Center And Scale OK
>> 2.5. RDD operations OK
>>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>        Model evaluation/optimization (rank, numIter, lambda) with
>> itertools OK
>> 3. Scala - MLlib
>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 3.2. LinearRegressionWithSGD OK
>> 3.3. Decision Tree OK
>> 3.4. KMeans OK
>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> 3.6. saveAsParquetFile OK
>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>> registerTempTable, sql OK
>> 3.8. result = sqlContext.sql("SELECT
>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>> 4.0. Spark SQL from Python OK
>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
>> 5.0. Packages
>> 5.1. com.databricks.spark.csv - read/write OK
>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>> com.databricks:spark-csv_2.11:1.2.0 worked)
>> 6.0. DataFrames
>> 6.1. cast,dtypes OK
>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>> 6.3. All joins,sql,set operations,udf OK
>>
>> Two Problems:
>>
>> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
>> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
>> So programs that depend on the case of the synthetic column names would
>> fail.
>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
>> bytes of memory’
>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>> with the same error
>>     Is this a known bug ?
>> Cheers
>> <k/>
>> P.S: Sorry for the spam, forgot Reply All
>>
>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc3:
>>
>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc3) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>
>>
>> =======================================
>> How can I help test this release?
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> ================================================
>> What justifies a -1 vote for this release?
>> ================================================
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===============================================================
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> Major changes to help you focus your testing
>> ==================================================
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, including date/time, string, math
>> - memory and local disk only checkpointing
>>
>> DataFrame/SQL Backend Execution
>>
>> - Code generation on by default
>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> algorithms and external algorithms
>> - Improved window function performance
>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>
>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>
>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> Standalone)
>> - Improved Mesos support (framework authentication, roles, dynamic
>> allocation, constraints)
>> - Improved YARN support (dynamic allocation with preferred locations)
>> - Improved Hive support (metastore partition pruning, metastore
>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>> - Support persisting data in Hive compatible format in metastore
>> - Support data partitioning for JSON data sources
>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>> metadata discovery and schema merging, support reading non-standard legacy
>> Parquet files generated by other libraries)
>> - Faster and more robust dynamic partition insert
>> - DataSourceRegister interface for external data sources to specify short
>> names
>>
>> SparkR
>>
>> - YARN cluster mode in R
>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>> regularization
>> - Improved error messages
>> - Aliases to make DataFrame functions more R-like
>>
>> Streaming
>>
>> - Backpressure for handling bursty input streams.
>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>> MQTT, Flume)
>> - Improved Python streaming machine learning algorithms (K-Means, linear
>> regression, logistic regression)
>> - Native reliable Kinesis stream support
>> - Input metadata like Kafka offsets made visible in the batch details UI
>> - Better load balancing and scheduling of receivers across cluster
>> - Include streaming storage in web UI
>>
>> Machine Learning and Advanced Analytics
>>
>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>> regression.
>> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
>> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
>> test.
>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>> - More efficient Pregel API implementation for GraphX
>> - Model summary for linear and logistic regression.
>> - Python API: distributed matrices, streaming k-means and linear models,
>> LDA, power iteration clustering, etc.
>> - Tuning and evaluation: train-validation split and multiclass
>> classification evaluator.
>> - Documentation: document the release version of public API methods
>>
>>
>>
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by james <yi...@gmail.com>.
add a critical bug https://issues.apache.org/jira/browse/SPARK-10474
(Aggregation failed with unable to acquire memory)



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13987.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


RE: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by "Cheng, Hao" <ha...@intel.com>.
Not sure if it’s too late, but we found a critical bug at https://issues.apache.org/jira/browse/SPARK-10466
UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle with data spill, this is not acceptable as it’s very common in a large table joins.

From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Saturday, September 5, 2015 3:30 PM
To: Krishna Sankar
Cc: Davies Liu; Yin Huai; Tom Graves; dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Thanks, Krishna, for the report. We should fix your problem using the Python UDFs in 1.6 too.

I'm going to close this vote now. Thanks everybody for voting. This vote passes with 8 +1 votes (3 binding) and no 0 or -1 votes.

+1:
Reynold Xin*
Tom Graves*
Burak Yavuz
Michael Armbrust*
Davies Liu
Forest Fang
Krishna Sankar
Denny Lee

0:

-1:


I will work on packaging this release in the next few days.



On Fri, Sep 4, 2015 at 8:08 PM, Krishna Sankar <ks...@gmail.com>> wrote:
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.

+1 from my side for 1.5.0 RC3.
Cheers
<k/>

On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu <da...@databricks.com>> wrote:
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).

Once remove those two udfs, it runs successfully, also much faster.

On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar <ks...@gmail.com>> wrote:
> Yin,
>    It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
> <k/>
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai <yh...@databricks.com>> wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ks...@gmail.com>>
>> wrote:
>>>
>>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>> <k/>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com>> wrote:
>>>>
>>>> The upper/lower case thing is known.
>>>> https://issues.apache.org/jira/browse/SPARK-9550
>>>> I assume it was decided to be ok and its going to be in the release
>>>> notes  but Reynold or Josh can probably speak to it more.
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
>>>> <ks...@gmail.com>> wrote:
>>>>
>>>>
>>>> +?
>>>>
>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>> 2. Tested pyspark, mllib
>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>> 2.3. Decision Tree, Naive Bayes OK
>>>> 2.4. KMeans OK
>>>>        Center And Scale OK
>>>> 2.5. RDD operations OK
>>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>> count)
>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>        Model evaluation/optimization (rank, numIter, lambda) with
>>>> itertools OK
>>>> 3. Scala - MLlib
>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>> 3.2. LinearRegressionWithSGD OK
>>>> 3.3. Decision Tree OK
>>>> 3.4. KMeans OK
>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>> 3.6. saveAsParquetFile OK
>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>> registerTempTable, sql OK
>>>> 3.8. result = sqlContext.sql("SELECT
>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>> 4.0. Spark SQL from Python OK
>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>> OK
>>>> 5.0. Packages
>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>> 6.0. DataFrames
>>>> 6.1. cast,dtypes OK
>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>> 6.3. All joins,sql,set operations,udf OK
>>>>
>>>> Two Problems:
>>>>
>>>> 1. The synthetic column names are lowercase ( i.e. now
>>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
>>>> previously 'AVG(Total)'). So programs that depend on the case of the
>>>> synthetic column names would fail.
>>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>>>>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
>>>> bytes of memory’
>>>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>>>> with the same error
>>>>     Is this a known bug ?
>>>> Cheers
>>>> <k/>
>>>> P.S: Sorry for the spam, forgot Reply All
>>>>
>>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com>> wrote:
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark version
>>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if
>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>>
>>>> The tag to be voted on is v1.5.0-rc3:
>>>>
>>>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release (published as 1.5.0-rc3) can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>>>
>>>> The staging repository for this release (published as 1.5.0) can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>>>
>>>>
>>>> =======================================
>>>> How can I help test this release?
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>>
>>>> ================================================
>>>> What justifies a -1 vote for this release?
>>>> ================================================
>>>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>>>> should only occur for significant regressions from 1.4. Bugs already present
>>>> in 1.4, minor regressions, or bugs related to new features will not block
>>>> this release.
>>>>
>>>>
>>>> ===============================================================
>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>> branch-1.5, since documentations will be packaged separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> Major changes to help you focus your testing
>>>> ==================================================
>>>>
>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>> complete list, please refer to Apache JIRA changelog.
>>>>
>>>> RDD/DataFrame/SQL APIs
>>>>
>>>> - New UDAF interface
>>>> - DataFrame hints for broadcast join
>>>> - expr function for turning a SQL expression into DataFrame column
>>>> - Improved support for NaN values
>>>> - StructType now supports ordering
>>>> - TimestampType precision is reduced to 1us
>>>> - 100 new built-in expressions, including date/time, string, math
>>>> - memory and local disk only checkpointing
>>>>
>>>> DataFrame/SQL Backend Execution
>>>>
>>>> - Code generation on by default
>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>> algorithms and external algorithms
>>>> - Improved window function performance
>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>> plans
>>>>
>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>
>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>> Standalone)
>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>> allocation, constraints)
>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>> - Improved Hive support (metastore partition pruning, metastore
>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>> - Support persisting data in Hive compatible format in metastore
>>>> - Support data partitioning for JSON data sources
>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>> Parquet files generated by other libraries)
>>>> - Faster and more robust dynamic partition insert
>>>> - DataSourceRegister interface for external data sources to specify
>>>> short names
>>>>
>>>> SparkR
>>>>
>>>> - YARN cluster mode in R
>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>> regularization
>>>> - Improved error messages
>>>> - Aliases to make DataFrame functions more R-like
>>>>
>>>> Streaming
>>>>
>>>> - Backpressure for handling bursty input streams.
>>>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>>>> MQTT, Flume)
>>>> - Improved Python streaming machine learning algorithms (K-Means, linear
>>>> regression, logistic regression)
>>>> - Native reliable Kinesis stream support
>>>> - Input metadata like Kafka offsets made visible in the batch details UI
>>>> - Better load balancing and scheduling of receivers across cluster
>>>> - Include streaming storage in web UI
>>>>
>>>> Machine Learning and Advanced Analytics
>>>>
>>>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>> regression.
>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>> sequential pattern mining, association rule generation, 1-sample
>>>> Kolmogorov-Smirnov test.
>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>> - More efficient Pregel API implementation for GraphX
>>>> - Model summary for linear and logistic regression.
>>>> - Python API: distributed matrices, streaming k-means and linear models,
>>>> LDA, power iteration clustering, etc.
>>>> - Tuning and evaluation: train-validation split and multiclass
>>>> classification evaluator.
>>>> - Documentation: document the release version of public API methods
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>



Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Reynold Xin <rx...@databricks.com>.
Thanks, Krishna, for the report. We should fix your problem using the
Python UDFs in 1.6 too.

I'm going to close this vote now. Thanks everybody for voting. This vote
passes with 8 +1 votes (3 binding) and no 0 or -1 votes.

+1:
Reynold Xin*
Tom Graves*
Burak Yavuz
Michael Armbrust*
Davies Liu
Forest Fang
Krishna Sankar
Denny Lee

0:

-1:


I will work on packaging this release in the next few days.



On Fri, Sep 4, 2015 at 8:08 PM, Krishna Sankar <ks...@gmail.com> wrote:

> Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
> This was exactly why I had put in the elapsed time calculations.
> And thanks for the new pyspark.sql.functions.
>
> +1 from my side for 1.5.0 RC3.
> Cheers
> <k/>
>
> On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu <da...@databricks.com> wrote:
>
>> Could you update the notebook to use builtin SQL function month and year,
>> instead of Python UDF? (they are introduced in 1.5).
>>
>> Once remove those two udfs, it runs successfully, also much faster.
>>
>> On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar <ks...@gmail.com>
>> wrote:
>> > Yin,
>> >    It is the
>> > https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
>> > Cheers
>> > <k/>
>> >
>> > On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai <yh...@databricks.com> wrote:
>> >>
>> >> Hi Krishna,
>> >>
>> >> Can you share your code to reproduce the memory allocation issue?
>> >>
>> >> Thanks,
>> >>
>> >> Yin
>> >>
>> >> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ks...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>> >>> Now my vote is +1/2 unless the memory error is known and has a
>> >>> workaround.
>> >>>
>> >>> Cheers
>> >>> <k/>
>> >>>
>> >>>
>> >>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com>
>> wrote:
>> >>>>
>> >>>> The upper/lower case thing is known.
>> >>>> https://issues.apache.org/jira/browse/SPARK-9550
>> >>>> I assume it was decided to be ok and its going to be in the release
>> >>>> notes  but Reynold or Josh can probably speak to it more.
>> >>>>
>> >>>> Tom
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
>> >>>> <ks...@gmail.com> wrote:
>> >>>>
>> >>>>
>> >>>> +?
>> >>>>
>> >>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>> >>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>> >>>> 2. Tested pyspark, mllib
>> >>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>> >>>> 2.2. Linear/Ridge/Laso Regression OK
>> >>>> 2.3. Decision Tree, Naive Bayes OK
>> >>>> 2.4. KMeans OK
>> >>>>        Center And Scale OK
>> >>>> 2.5. RDD operations OK
>> >>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word
>> >>>> count)
>> >>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> >>>>        Model evaluation/optimization (rank, numIter, lambda) with
>> >>>> itertools OK
>> >>>> 3. Scala - MLlib
>> >>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>> >>>> 3.2. LinearRegressionWithSGD OK
>> >>>> 3.3. Decision Tree OK
>> >>>> 3.4. KMeans OK
>> >>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> >>>> 3.6. saveAsParquetFile OK
>> >>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>> >>>> registerTempTable, sql OK
>> >>>> 3.8. result = sqlContext.sql("SELECT
>> >>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders
>> INNER
>> >>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>> >>>> 4.0. Spark SQL from Python OK
>> >>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State =
>> 'WA'")
>> >>>> OK
>> >>>> 5.0. Packages
>> >>>> 5.1. com.databricks.spark.csv - read/write OK
>> >>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work.
>> But
>> >>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>> >>>> 6.0. DataFrames
>> >>>> 6.1. cast,dtypes OK
>> >>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>> >>>> 6.3. All joins,sql,set operations,udf OK
>> >>>>
>> >>>> Two Problems:
>> >>>>
>> >>>> 1. The synthetic column names are lowercase ( i.e. now
>> >>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
>> >>>> previously 'AVG(Total)'). So programs that depend on the case of the
>> >>>> synthetic column names would fail.
>> >>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>> >>>>     fails with the error ‘java.io.IOException: Unable to acquire
>> 4194304
>> >>>> bytes of memory’
>> >>>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>> >>>> with the same error
>> >>>>     Is this a known bug ?
>> >>>> Cheers
>> >>>> <k/>
>> >>>> P.S: Sorry for the spam, forgot Reply All
>> >>>>
>> >>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>>
>> >>>> Please vote on releasing the following candidate as Apache Spark
>> version
>> >>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and
>> passes if
>> >>>> a majority of at least 3 +1 PMC votes are cast.
>> >>>>
>> >>>> [ ] +1 Release this package as Apache Spark 1.5.0
>> >>>> [ ] -1 Do not release this package because ...
>> >>>>
>> >>>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >>>>
>> >>>>
>> >>>> The tag to be voted on is v1.5.0-rc3:
>> >>>>
>> >>>>
>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>> >>>>
>> >>>> The release files, including signatures, digests, etc. can be found
>> at:
>> >>>>
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>> >>>>
>> >>>> Release artifacts are signed with the following key:
>> >>>> https://people.apache.org/keys/committer/pwendell.asc
>> >>>>
>> >>>> The staging repository for this release (published as 1.5.0-rc3) can
>> be
>> >>>> found at:
>> >>>>
>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>> >>>>
>> >>>> The staging repository for this release (published as 1.5.0) can be
>> >>>> found at:
>> >>>>
>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>> >>>>
>> >>>> The documentation corresponding to this release can be found at:
>> >>>>
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>> >>>>
>> >>>>
>> >>>> =======================================
>> >>>> How can I help test this release?
>> >>>> =======================================
>> >>>> If you are a Spark user, you can help us test this release by taking
>> an
>> >>>> existing Spark workload and running on this release candidate, then
>> >>>> reporting any regressions.
>> >>>>
>> >>>>
>> >>>> ================================================
>> >>>> What justifies a -1 vote for this release?
>> >>>> ================================================
>> >>>> This vote is happening towards the end of the 1.5 QA period, so -1
>> votes
>> >>>> should only occur for significant regressions from 1.4. Bugs already
>> present
>> >>>> in 1.4, minor regressions, or bugs related to new features will not
>> block
>> >>>> this release.
>> >>>>
>> >>>>
>> >>>> ===============================================================
>> >>>> What should happen to JIRA tickets still targeting 1.5.0?
>> >>>> ===============================================================
>> >>>> 1. It is OK for documentation patches to target 1.5.0 and still go
>> into
>> >>>> branch-1.5, since documentations will be packaged separately from the
>> >>>> release.
>> >>>> 2. New features for non-alpha-modules should target 1.6+.
>> >>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>> >>>> target version.
>> >>>>
>> >>>>
>> >>>> ==================================================
>> >>>> Major changes to help you focus your testing
>> >>>> ==================================================
>> >>>>
>> >>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> >>>> contributors. I've curated a list of important changes for 1.5. For
>> the
>> >>>> complete list, please refer to Apache JIRA changelog.
>> >>>>
>> >>>> RDD/DataFrame/SQL APIs
>> >>>>
>> >>>> - New UDAF interface
>> >>>> - DataFrame hints for broadcast join
>> >>>> - expr function for turning a SQL expression into DataFrame column
>> >>>> - Improved support for NaN values
>> >>>> - StructType now supports ordering
>> >>>> - TimestampType precision is reduced to 1us
>> >>>> - 100 new built-in expressions, including date/time, string, math
>> >>>> - memory and local disk only checkpointing
>> >>>>
>> >>>> DataFrame/SQL Backend Execution
>> >>>>
>> >>>> - Code generation on by default
>> >>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> >>>> algorithms and external algorithms
>> >>>> - Improved window function performance
>> >>>> - Better metrics instrumentation and reporting for DF/SQL execution
>> >>>> plans
>> >>>>
>> >>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>> >>>>
>> >>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> >>>> Standalone)
>> >>>> - Improved Mesos support (framework authentication, roles, dynamic
>> >>>> allocation, constraints)
>> >>>> - Improved YARN support (dynamic allocation with preferred locations)
>> >>>> - Improved Hive support (metastore partition pruning, metastore
>> >>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>> >>>> - Support persisting data in Hive compatible format in metastore
>> >>>> - Support data partitioning for JSON data sources
>> >>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>> >>>> metadata discovery and schema merging, support reading non-standard
>> legacy
>> >>>> Parquet files generated by other libraries)
>> >>>> - Faster and more robust dynamic partition insert
>> >>>> - DataSourceRegister interface for external data sources to specify
>> >>>> short names
>> >>>>
>> >>>> SparkR
>> >>>>
>> >>>> - YARN cluster mode in R
>> >>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>> >>>> regularization
>> >>>> - Improved error messages
>> >>>> - Aliases to make DataFrame functions more R-like
>> >>>>
>> >>>> Streaming
>> >>>>
>> >>>> - Backpressure for handling bursty input streams.
>> >>>> - Improved Python support for streaming sources (Kafka offsets,
>> Kinesis,
>> >>>> MQTT, Flume)
>> >>>> - Improved Python streaming machine learning algorithms (K-Means,
>> linear
>> >>>> regression, logistic regression)
>> >>>> - Native reliable Kinesis stream support
>> >>>> - Input metadata like Kafka offsets made visible in the batch
>> details UI
>> >>>> - Better load balancing and scheduling of receivers across cluster
>> >>>> - Include streaming storage in web UI
>> >>>>
>> >>>> Machine Learning and Advanced Analytics
>> >>>>
>> >>>> - Feature transformers: CountVectorizer, Discrete Cosine
>> transformation,
>> >>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
>> VectorSlicer.
>> >>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>> >>>> regression.
>> >>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>> >>>> sequential pattern mining, association rule generation, 1-sample
>> >>>> Kolmogorov-Smirnov test.
>> >>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>> >>>> - More efficient Pregel API implementation for GraphX
>> >>>> - Model summary for linear and logistic regression.
>> >>>> - Python API: distributed matrices, streaming k-means and linear
>> models,
>> >>>> LDA, power iteration clustering, etc.
>> >>>> - Tuning and evaluation: train-validation split and multiclass
>> >>>> classification evaluator.
>> >>>> - Documentation: document the release version of public API methods
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Krishna Sankar <ks...@gmail.com>.
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.

+1 from my side for 1.5.0 RC3.
Cheers
<k/>

On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu <da...@databricks.com> wrote:

> Could you update the notebook to use builtin SQL function month and year,
> instead of Python UDF? (they are introduced in 1.5).
>
> Once remove those two udfs, it runs successfully, also much faster.
>
> On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar <ks...@gmail.com>
> wrote:
> > Yin,
> >    It is the
> > https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> > Cheers
> > <k/>
> >
> > On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai <yh...@databricks.com> wrote:
> >>
> >> Hi Krishna,
> >>
> >> Can you share your code to reproduce the memory allocation issue?
> >>
> >> Thanks,
> >>
> >> Yin
> >>
> >> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ks...@gmail.com>
> >> wrote:
> >>>
> >>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
> >>> Now my vote is +1/2 unless the memory error is known and has a
> >>> workaround.
> >>>
> >>> Cheers
> >>> <k/>
> >>>
> >>>
> >>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com>
> wrote:
> >>>>
> >>>> The upper/lower case thing is known.
> >>>> https://issues.apache.org/jira/browse/SPARK-9550
> >>>> I assume it was decided to be ok and its going to be in the release
> >>>> notes  but Reynold or Josh can probably speak to it more.
> >>>>
> >>>> Tom
> >>>>
> >>>>
> >>>>
> >>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
> >>>> <ks...@gmail.com> wrote:
> >>>>
> >>>>
> >>>> +?
> >>>>
> >>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
> >>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> >>>> 2. Tested pyspark, mllib
> >>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>> 2.2. Linear/Ridge/Laso Regression OK
> >>>> 2.3. Decision Tree, Naive Bayes OK
> >>>> 2.4. KMeans OK
> >>>>        Center And Scale OK
> >>>> 2.5. RDD operations OK
> >>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word
> >>>> count)
> >>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >>>>        Model evaluation/optimization (rank, numIter, lambda) with
> >>>> itertools OK
> >>>> 3. Scala - MLlib
> >>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>> 3.2. LinearRegressionWithSGD OK
> >>>> 3.3. Decision Tree OK
> >>>> 3.4. KMeans OK
> >>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >>>> 3.6. saveAsParquetFile OK
> >>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> >>>> registerTempTable, sql OK
> >>>> 3.8. result = sqlContext.sql("SELECT
> >>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders
> INNER
> >>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> >>>> 4.0. Spark SQL from Python OK
> >>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State =
> 'WA'")
> >>>> OK
> >>>> 5.0. Packages
> >>>> 5.1. com.databricks.spark.csv - read/write OK
> >>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work.
> But
> >>>> com.databricks:spark-csv_2.11:1.2.0 worked)
> >>>> 6.0. DataFrames
> >>>> 6.1. cast,dtypes OK
> >>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> >>>> 6.3. All joins,sql,set operations,udf OK
> >>>>
> >>>> Two Problems:
> >>>>
> >>>> 1. The synthetic column names are lowercase ( i.e. now
> >>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
> >>>> previously 'AVG(Total)'). So programs that depend on the case of the
> >>>> synthetic column names would fail.
> >>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
> >>>>     fails with the error ‘java.io.IOException: Unable to acquire
> 4194304
> >>>> bytes of memory’
> >>>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
> >>>> with the same error
> >>>>     Is this a known bug ?
> >>>> Cheers
> >>>> <k/>
> >>>> P.S: Sorry for the spam, forgot Reply All
> >>>>
> >>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>>
> >>>> Please vote on releasing the following candidate as Apache Spark
> version
> >>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and
> passes if
> >>>> a majority of at least 3 +1 PMC votes are cast.
> >>>>
> >>>> [ ] +1 Release this package as Apache Spark 1.5.0
> >>>> [ ] -1 Do not release this package because ...
> >>>>
> >>>> To learn more about Apache Spark, please see http://spark.apache.org/
> >>>>
> >>>>
> >>>> The tag to be voted on is v1.5.0-rc3:
> >>>>
> >>>>
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
> >>>>
> >>>> The release files, including signatures, digests, etc. can be found
> at:
> >>>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
> >>>>
> >>>> Release artifacts are signed with the following key:
> >>>> https://people.apache.org/keys/committer/pwendell.asc
> >>>>
> >>>> The staging repository for this release (published as 1.5.0-rc3) can
> be
> >>>> found at:
> >>>>
> https://repository.apache.org/content/repositories/orgapachespark-1143/
> >>>>
> >>>> The staging repository for this release (published as 1.5.0) can be
> >>>> found at:
> >>>>
> https://repository.apache.org/content/repositories/orgapachespark-1142/
> >>>>
> >>>> The documentation corresponding to this release can be found at:
> >>>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
> >>>>
> >>>>
> >>>> =======================================
> >>>> How can I help test this release?
> >>>> =======================================
> >>>> If you are a Spark user, you can help us test this release by taking
> an
> >>>> existing Spark workload and running on this release candidate, then
> >>>> reporting any regressions.
> >>>>
> >>>>
> >>>> ================================================
> >>>> What justifies a -1 vote for this release?
> >>>> ================================================
> >>>> This vote is happening towards the end of the 1.5 QA period, so -1
> votes
> >>>> should only occur for significant regressions from 1.4. Bugs already
> present
> >>>> in 1.4, minor regressions, or bugs related to new features will not
> block
> >>>> this release.
> >>>>
> >>>>
> >>>> ===============================================================
> >>>> What should happen to JIRA tickets still targeting 1.5.0?
> >>>> ===============================================================
> >>>> 1. It is OK for documentation patches to target 1.5.0 and still go
> into
> >>>> branch-1.5, since documentations will be packaged separately from the
> >>>> release.
> >>>> 2. New features for non-alpha-modules should target 1.6+.
> >>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
> >>>> target version.
> >>>>
> >>>>
> >>>> ==================================================
> >>>> Major changes to help you focus your testing
> >>>> ==================================================
> >>>>
> >>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
> >>>> contributors. I've curated a list of important changes for 1.5. For
> the
> >>>> complete list, please refer to Apache JIRA changelog.
> >>>>
> >>>> RDD/DataFrame/SQL APIs
> >>>>
> >>>> - New UDAF interface
> >>>> - DataFrame hints for broadcast join
> >>>> - expr function for turning a SQL expression into DataFrame column
> >>>> - Improved support for NaN values
> >>>> - StructType now supports ordering
> >>>> - TimestampType precision is reduced to 1us
> >>>> - 100 new built-in expressions, including date/time, string, math
> >>>> - memory and local disk only checkpointing
> >>>>
> >>>> DataFrame/SQL Backend Execution
> >>>>
> >>>> - Code generation on by default
> >>>> - Improved join, aggregation, shuffle, sorting with cache friendly
> >>>> algorithms and external algorithms
> >>>> - Improved window function performance
> >>>> - Better metrics instrumentation and reporting for DF/SQL execution
> >>>> plans
> >>>>
> >>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
> >>>>
> >>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
> >>>> Standalone)
> >>>> - Improved Mesos support (framework authentication, roles, dynamic
> >>>> allocation, constraints)
> >>>> - Improved YARN support (dynamic allocation with preferred locations)
> >>>> - Improved Hive support (metastore partition pruning, metastore
> >>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> >>>> - Support persisting data in Hive compatible format in metastore
> >>>> - Support data partitioning for JSON data sources
> >>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> >>>> metadata discovery and schema merging, support reading non-standard
> legacy
> >>>> Parquet files generated by other libraries)
> >>>> - Faster and more robust dynamic partition insert
> >>>> - DataSourceRegister interface for external data sources to specify
> >>>> short names
> >>>>
> >>>> SparkR
> >>>>
> >>>> - YARN cluster mode in R
> >>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> >>>> regularization
> >>>> - Improved error messages
> >>>> - Aliases to make DataFrame functions more R-like
> >>>>
> >>>> Streaming
> >>>>
> >>>> - Backpressure for handling bursty input streams.
> >>>> - Improved Python support for streaming sources (Kafka offsets,
> Kinesis,
> >>>> MQTT, Flume)
> >>>> - Improved Python streaming machine learning algorithms (K-Means,
> linear
> >>>> regression, logistic regression)
> >>>> - Native reliable Kinesis stream support
> >>>> - Input metadata like Kafka offsets made visible in the batch details
> UI
> >>>> - Better load balancing and scheduling of receivers across cluster
> >>>> - Include streaming storage in web UI
> >>>>
> >>>> Machine Learning and Advanced Analytics
> >>>>
> >>>> - Feature transformers: CountVectorizer, Discrete Cosine
> transformation,
> >>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and
> VectorSlicer.
> >>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> >>>> regression.
> >>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
> >>>> sequential pattern mining, association rule generation, 1-sample
> >>>> Kolmogorov-Smirnov test.
> >>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> >>>> - More efficient Pregel API implementation for GraphX
> >>>> - Model summary for linear and logistic regression.
> >>>> - Python API: distributed matrices, streaming k-means and linear
> models,
> >>>> LDA, power iteration clustering, etc.
> >>>> - Tuning and evaluation: train-validation split and multiclass
> >>>> classification evaluator.
> >>>> - Documentation: document the release version of public API methods
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Davies Liu <da...@databricks.com>.
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).

Once remove those two udfs, it runs successfully, also much faster.

On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar <ks...@gmail.com> wrote:
> Yin,
>    It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
> <k/>
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai <yh...@databricks.com> wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ks...@gmail.com>
>> wrote:
>>>
>>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>> <k/>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com> wrote:
>>>>
>>>> The upper/lower case thing is known.
>>>> https://issues.apache.org/jira/browse/SPARK-9550
>>>> I assume it was decided to be ok and its going to be in the release
>>>> notes  but Reynold or Josh can probably speak to it more.
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
>>>> <ks...@gmail.com> wrote:
>>>>
>>>>
>>>> +?
>>>>
>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>> 2. Tested pyspark, mllib
>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>> 2.3. Decision Tree, Naive Bayes OK
>>>> 2.4. KMeans OK
>>>>        Center And Scale OK
>>>> 2.5. RDD operations OK
>>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>> count)
>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>        Model evaluation/optimization (rank, numIter, lambda) with
>>>> itertools OK
>>>> 3. Scala - MLlib
>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>> 3.2. LinearRegressionWithSGD OK
>>>> 3.3. Decision Tree OK
>>>> 3.4. KMeans OK
>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>> 3.6. saveAsParquetFile OK
>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>> registerTempTable, sql OK
>>>> 3.8. result = sqlContext.sql("SELECT
>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>>> 4.0. Spark SQL from Python OK
>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>>> OK
>>>> 5.0. Packages
>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>> 6.0. DataFrames
>>>> 6.1. cast,dtypes OK
>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>> 6.3. All joins,sql,set operations,udf OK
>>>>
>>>> Two Problems:
>>>>
>>>> 1. The synthetic column names are lowercase ( i.e. now
>>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
>>>> previously 'AVG(Total)'). So programs that depend on the case of the
>>>> synthetic column names would fail.
>>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>>>>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
>>>> bytes of memory’
>>>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>>>> with the same error
>>>>     Is this a known bug ?
>>>> Cheers
>>>> <k/>
>>>> P.S: Sorry for the spam, forgot Reply All
>>>>
>>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark version
>>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if
>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>>
>>>> The tag to be voted on is v1.5.0-rc3:
>>>>
>>>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release (published as 1.5.0-rc3) can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>>>
>>>> The staging repository for this release (published as 1.5.0) can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>>>
>>>>
>>>> =======================================
>>>> How can I help test this release?
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>>
>>>> ================================================
>>>> What justifies a -1 vote for this release?
>>>> ================================================
>>>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>>>> should only occur for significant regressions from 1.4. Bugs already present
>>>> in 1.4, minor regressions, or bugs related to new features will not block
>>>> this release.
>>>>
>>>>
>>>> ===============================================================
>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>>> branch-1.5, since documentations will be packaged separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> Major changes to help you focus your testing
>>>> ==================================================
>>>>
>>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>>> contributors. I've curated a list of important changes for 1.5. For the
>>>> complete list, please refer to Apache JIRA changelog.
>>>>
>>>> RDD/DataFrame/SQL APIs
>>>>
>>>> - New UDAF interface
>>>> - DataFrame hints for broadcast join
>>>> - expr function for turning a SQL expression into DataFrame column
>>>> - Improved support for NaN values
>>>> - StructType now supports ordering
>>>> - TimestampType precision is reduced to 1us
>>>> - 100 new built-in expressions, including date/time, string, math
>>>> - memory and local disk only checkpointing
>>>>
>>>> DataFrame/SQL Backend Execution
>>>>
>>>> - Code generation on by default
>>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>>> algorithms and external algorithms
>>>> - Improved window function performance
>>>> - Better metrics instrumentation and reporting for DF/SQL execution
>>>> plans
>>>>
>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>
>>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>>> Standalone)
>>>> - Improved Mesos support (framework authentication, roles, dynamic
>>>> allocation, constraints)
>>>> - Improved YARN support (dynamic allocation with preferred locations)
>>>> - Improved Hive support (metastore partition pruning, metastore
>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>> - Support persisting data in Hive compatible format in metastore
>>>> - Support data partitioning for JSON data sources
>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>>> metadata discovery and schema merging, support reading non-standard legacy
>>>> Parquet files generated by other libraries)
>>>> - Faster and more robust dynamic partition insert
>>>> - DataSourceRegister interface for external data sources to specify
>>>> short names
>>>>
>>>> SparkR
>>>>
>>>> - YARN cluster mode in R
>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>> regularization
>>>> - Improved error messages
>>>> - Aliases to make DataFrame functions more R-like
>>>>
>>>> Streaming
>>>>
>>>> - Backpressure for handling bursty input streams.
>>>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>>>> MQTT, Flume)
>>>> - Improved Python streaming machine learning algorithms (K-Means, linear
>>>> regression, logistic regression)
>>>> - Native reliable Kinesis stream support
>>>> - Input metadata like Kafka offsets made visible in the batch details UI
>>>> - Better load balancing and scheduling of receivers across cluster
>>>> - Include streaming storage in web UI
>>>>
>>>> Machine Learning and Advanced Analytics
>>>>
>>>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>>> regression.
>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>>> sequential pattern mining, association rule generation, 1-sample
>>>> Kolmogorov-Smirnov test.
>>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>>> - More efficient Pregel API implementation for GraphX
>>>> - Model summary for linear and logistic regression.
>>>> - Python API: distributed matrices, streaming k-means and linear models,
>>>> LDA, power iteration clustering, etc.
>>>> - Tuning and evaluation: train-validation split and multiclass
>>>> classification evaluator.
>>>> - Documentation: document the release version of public API methods
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Krishna Sankar <ks...@gmail.com>.
Yin,
   It is the
https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
Cheers
<k/>

On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai <yh...@databricks.com> wrote:

> Hi Krishna,
>
> Can you share your code to reproduce the memory allocation issue?
>
> Thanks,
>
> Yin
>
> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ks...@gmail.com>
> wrote:
>
>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>> Now my vote is +1/2 unless the memory error is known and has a workaround.
>>
>> Cheers
>> <k/>
>>
>>
>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com> wrote:
>>
>>> The upper/lower case thing is known.
>>> https://issues.apache.org/jira/browse/SPARK-9550
>>> I assume it was decided to be ok and its going to be in the release
>>> notes  but Reynold or Josh can probably speak to it more.
>>>
>>> Tom
>>>
>>>
>>>
>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
>>> ksankar42@gmail.com> wrote:
>>>
>>>
>>> +?
>>>
>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>> 2. Tested pyspark, mllib
>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 2.2. Linear/Ridge/Laso Regression OK
>>> 2.3. Decision Tree, Naive Bayes OK
>>> 2.4. KMeans OK
>>>        Center And Scale OK
>>> 2.5. RDD operations OK
>>>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>        Model evaluation/optimization (rank, numIter, lambda) with
>>> itertools OK
>>> 3. Scala - MLlib
>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 3.2. LinearRegressionWithSGD OK
>>> 3.3. Decision Tree OK
>>> 3.4. KMeans OK
>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>> 3.6. saveAsParquetFile OK
>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>> registerTempTable, sql OK
>>> 3.8. result = sqlContext.sql("SELECT
>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>> 4.0. Spark SQL from Python OK
>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>> OK
>>> 5.0. Packages
>>> 5.1. com.databricks.spark.csv - read/write OK
>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>> 6.0. DataFrames
>>> 6.1. cast,dtypes OK
>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>> 6.3. All joins,sql,set operations,udf OK
>>>
>>> Two Problems:
>>>
>>> 1. The synthetic column names are lowercase ( i.e. now
>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
>>> previously 'AVG(Total)'). So programs that depend on the case of the
>>> synthetic column names would fail.
>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>>>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
>>> bytes of memory’
>>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>>> with the same error
>>>     Is this a known bug ?
>>> Cheers
>>> <k/>
>>> P.S: Sorry for the spam, forgot Reply All
>>>
>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>>
>>> The tag to be voted on is v1.5.0-rc3:
>>>
>>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release (published as 1.5.0-rc3) can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>>
>>> The staging repository for this release (published as 1.5.0) can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>>
>>>
>>> =======================================
>>> How can I help test this release?
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>>
>>> ================================================
>>> What justifies a -1 vote for this release?
>>> ================================================
>>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>>> should only occur for significant regressions from 1.4. Bugs already
>>> present in 1.4, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>>
>>> ===============================================================
>>> What should happen to JIRA tickets still targeting 1.5.0?
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>>> branch-1.5, since documentations will be packaged separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.6+.
>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> Major changes to help you focus your testing
>>> ==================================================
>>>
>>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>>> contributors. I've curated a list of important changes for 1.5. For the
>>> complete list, please refer to Apache JIRA changelog.
>>>
>>> RDD/DataFrame/SQL APIs
>>>
>>> - New UDAF interface
>>> - DataFrame hints for broadcast join
>>> - expr function for turning a SQL expression into DataFrame column
>>> - Improved support for NaN values
>>> - StructType now supports ordering
>>> - TimestampType precision is reduced to 1us
>>> - 100 new built-in expressions, including date/time, string, math
>>> - memory and local disk only checkpointing
>>>
>>> DataFrame/SQL Backend Execution
>>>
>>> - Code generation on by default
>>> - Improved join, aggregation, shuffle, sorting with cache friendly
>>> algorithms and external algorithms
>>> - Improved window function performance
>>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>>
>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>
>>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>>> Standalone)
>>> - Improved Mesos support (framework authentication, roles, dynamic
>>> allocation, constraints)
>>> - Improved YARN support (dynamic allocation with preferred locations)
>>> - Improved Hive support (metastore partition pruning, metastore
>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>> - Support persisting data in Hive compatible format in metastore
>>> - Support data partitioning for JSON data sources
>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>>> metadata discovery and schema merging, support reading non-standard legacy
>>> Parquet files generated by other libraries)
>>> - Faster and more robust dynamic partition insert
>>> - DataSourceRegister interface for external data sources to specify
>>> short names
>>>
>>> SparkR
>>>
>>> - YARN cluster mode in R
>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>> regularization
>>> - Improved error messages
>>> - Aliases to make DataFrame functions more R-like
>>>
>>> Streaming
>>>
>>> - Backpressure for handling bursty input streams.
>>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>>> MQTT, Flume)
>>> - Improved Python streaming machine learning algorithms (K-Means, linear
>>> regression, logistic regression)
>>> - Native reliable Kinesis stream support
>>> - Input metadata like Kafka offsets made visible in the batch details UI
>>> - Better load balancing and scheduling of receivers across cluster
>>> - Include streaming storage in web UI
>>>
>>> Machine Learning and Advanced Analytics
>>>
>>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>>> regression.
>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for
>>> sequential pattern mining, association rule generation, 1-sample
>>> Kolmogorov-Smirnov test.
>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>>> - More efficient Pregel API implementation for GraphX
>>> - Model summary for linear and logistic regression.
>>> - Python API: distributed matrices, streaming k-means and linear models,
>>> LDA, power iteration clustering, etc.
>>> - Tuning and evaluation: train-validation split and multiclass
>>> classification evaluator.
>>> - Documentation: document the release version of public API methods
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Yin Huai <yh...@databricks.com>.
Hi Krishna,

Can you share your code to reproduce the memory allocation issue?

Thanks,

Yin

On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ks...@gmail.com> wrote:

> Thanks Tom.  Interestingly it happened between RC2 and RC3.
> Now my vote is +1/2 unless the memory error is known and has a workaround.
>
> Cheers
> <k/>
>
>
> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com> wrote:
>
>> The upper/lower case thing is known.
>> https://issues.apache.org/jira/browse/SPARK-9550
>> I assume it was decided to be ok and its going to be in the release notes
>>  but Reynold or Josh can probably speak to it more.
>>
>> Tom
>>
>>
>>
>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
>> ksankar42@gmail.com> wrote:
>>
>>
>> +?
>>
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>> 2. Tested pyspark, mllib
>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>        Center And Scale OK
>> 2.5. RDD operations OK
>>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>        Model evaluation/optimization (rank, numIter, lambda) with
>> itertools OK
>> 3. Scala - MLlib
>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 3.2. LinearRegressionWithSGD OK
>> 3.3. Decision Tree OK
>> 3.4. KMeans OK
>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> 3.6. saveAsParquetFile OK
>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>> registerTempTable, sql OK
>> 3.8. result = sqlContext.sql("SELECT
>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>> 4.0. Spark SQL from Python OK
>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
>> 5.0. Packages
>> 5.1. com.databricks.spark.csv - read/write OK
>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>> com.databricks:spark-csv_2.11:1.2.0 worked)
>> 6.0. DataFrames
>> 6.1. cast,dtypes OK
>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>> 6.3. All joins,sql,set operations,udf OK
>>
>> Two Problems:
>>
>> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
>> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
>> So programs that depend on the case of the synthetic column names would
>> fail.
>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
>> bytes of memory’
>>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>> with the same error
>>     Is this a known bug ?
>> Cheers
>> <k/>
>> P.S: Sorry for the spam, forgot Reply All
>>
>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc3:
>>
>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc3) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>
>>
>> =======================================
>> How can I help test this release?
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> ================================================
>> What justifies a -1 vote for this release?
>> ================================================
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===============================================================
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> Major changes to help you focus your testing
>> ==================================================
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, including date/time, string, math
>> - memory and local disk only checkpointing
>>
>> DataFrame/SQL Backend Execution
>>
>> - Code generation on by default
>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> algorithms and external algorithms
>> - Improved window function performance
>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>
>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>
>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> Standalone)
>> - Improved Mesos support (framework authentication, roles, dynamic
>> allocation, constraints)
>> - Improved YARN support (dynamic allocation with preferred locations)
>> - Improved Hive support (metastore partition pruning, metastore
>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>> - Support persisting data in Hive compatible format in metastore
>> - Support data partitioning for JSON data sources
>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>> metadata discovery and schema merging, support reading non-standard legacy
>> Parquet files generated by other libraries)
>> - Faster and more robust dynamic partition insert
>> - DataSourceRegister interface for external data sources to specify short
>> names
>>
>> SparkR
>>
>> - YARN cluster mode in R
>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>> regularization
>> - Improved error messages
>> - Aliases to make DataFrame functions more R-like
>>
>> Streaming
>>
>> - Backpressure for handling bursty input streams.
>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>> MQTT, Flume)
>> - Improved Python streaming machine learning algorithms (K-Means, linear
>> regression, logistic regression)
>> - Native reliable Kinesis stream support
>> - Input metadata like Kafka offsets made visible in the batch details UI
>> - Better load balancing and scheduling of receivers across cluster
>> - Include streaming storage in web UI
>>
>> Machine Learning and Advanced Analytics
>>
>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>> regression.
>> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
>> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
>> test.
>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>> - More efficient Pregel API implementation for GraphX
>> - Model summary for linear and logistic regression.
>> - Python API: distributed matrices, streaming k-means and linear models,
>> LDA, power iteration clustering, etc.
>> - Tuning and evaluation: train-validation split and multiclass
>> classification evaluator.
>> - Documentation: document the release version of public API methods
>>
>>
>>
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Krishna Sankar <ks...@gmail.com>.
Thanks Tom.  Interestingly it happened between RC2 and RC3.
Now my vote is +1/2 unless the memory error is known and has a workaround.

Cheers
<k/>


On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tg...@yahoo.com> wrote:

> The upper/lower case thing is known.
> https://issues.apache.org/jira/browse/SPARK-9550
> I assume it was decided to be ok and its going to be in the release notes
>  but Reynold or Josh can probably speak to it more.
>
> Tom
>
>
>
> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
> ksankar42@gmail.com> wrote:
>
>
> +?
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>        Center And Scale OK
> 2.5. RDD operations OK
>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>        Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Two Problems:
>
> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
> So programs that depend on the case of the synthetic column names would
> fail.
> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
> bytes of memory’
>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
> the same error
>     Is this a known bug ?
> Cheers
> <k/>
> P.S: Sorry for the spam, forgot Reply All
>
> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
>
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>
>
>
>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Tom Graves <tg...@yahoo.com.INVALID>.
The upper/lower case thing is known.  https://issues.apache.org/jira/browse/SPARK-9550I assume it was decided to be ok and its going to be in the release notes  but Reynold or Josh can probably speak to it more.
Tom 


     On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <ks...@gmail.com> wrote:
   

 +? 
1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests2. Tested pyspark, mllib2.1. statistics (min,max,mean,Pearson,Spearman) OK2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK2.4. KMeans OK       Center And Scale OK2.5. RDD operations OK      State of the Union Texts - MapReduce, Filter,sortByKey (word count)2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK       Model evaluation/optimization (rank, numIter, lambda) with itertools OK3. Scala - MLlib3.1. statistics (min,max,mean,Pearson,Spearman) OK3.2. LinearRegressionWithSGD OK3.3. Decision Tree OK3.4. KMeans OK3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK3.6. saveAsParquetFile OK3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, registerTempTable, sql OK3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK4.0. Spark SQL from Python OK4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK5.0. Packages5.1. com.databricks.spark.csv - read/write OK(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But com.databricks:spark-csv_2.11:1.2.0 worked)6.0. DataFrames 6.1. cast,dtypes OK6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK6.3. All joins,sql,set operations,udf OK
Two Problems:
1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)'). So programs that depend on the case of the synthetic column names would fail.2. orders_3.groupBy("Year","Month").sum('Total').show()    fails with the error ‘java.io.IOException: Unable to acquire 4194304 bytes of memory’    orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with the same error    Is this a known bug ?Cheers<k/>P.S: Sorry for the spam, forgot Reply All 
On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v1.5.0-rc3:https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
The release files, including signatures, digests, etc. can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
Release artifacts are signed with the following key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release (published as 1.5.0-rc3) can be found at:https://repository.apache.org/content/repositories/orgapachespark-1143/
The staging repository for this release (published as 1.5.0) can be found at:https://repository.apache.org/content/repositories/orgapachespark-1142/
The documentation corresponding to this release can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/

=======================================How can I help test this release?=======================================If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.

================================================What justifies a -1 vote for this release?================================================This vote is happening towards the end of the 1.5 QA period, so -1 votes should only occur for significant regressions from 1.4. Bugs already present in 1.4, minor regressions, or bugs related to new features will not block this release.

===============================================================What should happen to JIRA tickets still targeting 1.5.0?===============================================================1. It is OK for documentation patches to target 1.5.0 and still go into branch-1.5, since documentations will be packaged separately from the release.2. New features for non-alpha-modules should target 1.6+.3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target version.

==================================================Major changes to help you focus your testing==================================================
As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. I've curated a list of important changes for 1.5. For the complete list, please refer to Apache JIRA changelog.
RDD/DataFrame/SQL APIs
- New UDAF interface- DataFrame hints for broadcast join- expr function for turning a SQL expression into DataFrame column- Improved support for NaN values- StructType now supports ordering- TimestampType precision is reduced to 1us- 100 new built-in expressions, including date/time, string, math- memory and local disk only checkpointing
DataFrame/SQL Backend Execution
- Code generation on by default- Improved join, aggregation, shuffle, sorting with cache friendly algorithms and external algorithms- Improved window function performance- Better metrics instrumentation and reporting for DF/SQL execution plans
Data Sources, Hive, Hadoop, Mesos and Cluster Management
- Dynamic allocation support in all resource managers (Mesos, YARN, Standalone)- Improved Mesos support (framework authentication, roles, dynamic allocation, constraints)- Improved YARN support (dynamic allocation with preferred locations)- Improved Hive support (metastore partition pruning, metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)- Support persisting data in Hive compatible format in metastore- Support data partitioning for JSON data sources- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata discovery and schema merging, support reading non-standard legacy Parquet files generated by other libraries)- Faster and more robust dynamic partition insert- DataSourceRegister interface for external data sources to specify short names
SparkR
- YARN cluster mode in R- GLMs with R formula, binomial/Gaussian families, and elastic-net regularization- Improved error messages- Aliases to make DataFrame functions more R-like
Streaming
- Backpressure for handling bursty input streams.- Improved Python support for streaming sources (Kafka offsets, Kinesis, MQTT, Flume)- Improved Python streaming machine learning algorithms (K-Means, linear regression, logistic regression)- Native reliable Kinesis stream support- Input metadata like Kafka offsets made visible in the batch details UI- Better load balancing and scheduling of receivers across cluster- Include streaming storage in web UI
Machine Learning and Advanced Analytics
- Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic regression.- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov test.- Improvements to existing algorithms: LDA, trees/ensembles, GMMs- More efficient Pregel API implementation for GraphX- Model summary for linear and logistic regression.- Python API: distributed matrices, streaming k-means and linear models, LDA, power iteration clustering, etc.- Tuning and evaluation: train-validation split and multiclass classification evaluator.- Documentation: document the release version of public API methods





  

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Denny Lee <de...@gmail.com>.
+1

Distinct count test is blazing fast - awesome!,

On Thu, Sep 3, 2015 at 8:21 PM Krishna Sankar <ks...@gmail.com> wrote:

> +?
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>        Center And Scale OK
> 2.5. RDD operations OK
>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>        Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Two Problems:
>
> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
> So programs that depend on the case of the synthetic column names would
> fail.
> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>     fails with the error ‘java.io.IOException: Unable to acquire 4194304
> bytes of memory’
>     orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
> the same error
>     Is this a known bug ?
> Cheers
> <k/>
> P.S: Sorry for the spam, forgot Reply All
>
> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc3:
>>
>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc3) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1143/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1142/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>>
>>
>> =======================================
>> How can I help test this release?
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> ================================================
>> What justifies a -1 vote for this release?
>> ================================================
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===============================================================
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> Major changes to help you focus your testing
>> ==================================================
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, including date/time, string, math
>> - memory and local disk only checkpointing
>>
>> DataFrame/SQL Backend Execution
>>
>> - Code generation on by default
>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> algorithms and external algorithms
>> - Improved window function performance
>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>
>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>
>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> Standalone)
>> - Improved Mesos support (framework authentication, roles, dynamic
>> allocation, constraints)
>> - Improved YARN support (dynamic allocation with preferred locations)
>> - Improved Hive support (metastore partition pruning, metastore
>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>> - Support persisting data in Hive compatible format in metastore
>> - Support data partitioning for JSON data sources
>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
>> metadata discovery and schema merging, support reading non-standard legacy
>> Parquet files generated by other libraries)
>> - Faster and more robust dynamic partition insert
>> - DataSourceRegister interface for external data sources to specify short
>> names
>>
>> SparkR
>>
>> - YARN cluster mode in R
>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>> regularization
>> - Improved error messages
>> - Aliases to make DataFrame functions more R-like
>>
>> Streaming
>>
>> - Backpressure for handling bursty input streams.
>> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
>> MQTT, Flume)
>> - Improved Python streaming machine learning algorithms (K-Means, linear
>> regression, logistic regression)
>> - Native reliable Kinesis stream support
>> - Input metadata like Kafka offsets made visible in the batch details UI
>> - Better load balancing and scheduling of receivers across cluster
>> - Include streaming storage in web UI
>>
>> Machine Learning and Advanced Analytics
>>
>> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
>> regression.
>> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
>> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
>> test.
>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
>> - More efficient Pregel API implementation for GraphX
>> - Model summary for linear and logistic regression.
>> - Python API: distributed matrices, streaming k-means and linear models,
>> LDA, power iteration clustering, etc.
>> - Tuning and evaluation: train-validation split and multiclass
>> classification evaluator.
>> - Documentation: document the release version of public API methods
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Posted by Krishna Sankar <ks...@gmail.com>.
+?

1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Two Problems:

1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
So programs that depend on the case of the synthetic column names would
fail.
2. orders_3.groupBy("Year","Month").sum('Total').show()
    fails with the error ‘java.io.IOException: Unable to acquire 4194304
bytes of memory’
    orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
the same error
    Is this a known bug ?
Cheers
<k/>
P.S: Sorry for the spam, forgot Reply All

On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <rx...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
>
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> =======================================
> How can I help test this release?
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> ================================================
> What justifies a -1 vote for this release?
> ================================================
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===============================================================
> What should happen to JIRA tickets still targeting 1.5.0?
> ===============================================================
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==================================================
> Major changes to help you focus your testing
> ==================================================
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy
> Parquet files generated by other libraries)
> - Faster and more robust dynamic partition insert
> - DataSourceRegister interface for external data sources to specify short
> names
>
> SparkR
>
> - YARN cluster mode in R
> - GLMs with R formula, binomial/Gaussian families, and elastic-net
> regularization
> - Improved error messages
> - Aliases to make DataFrame functions more R-like
>
> Streaming
>
> - Backpressure for handling bursty input streams.
> - Improved Python support for streaming sources (Kafka offsets, Kinesis,
> MQTT, Flume)
> - Improved Python streaming machine learning algorithms (K-Means, linear
> regression, logistic regression)
> - Native reliable Kinesis stream support
> - Input metadata like Kafka offsets made visible in the batch details UI
> - Better load balancing and scheduling of receivers across cluster
> - Include streaming storage in web UI
>
> Machine Learning and Advanced Analytics
>
> - Feature transformers: CountVectorizer, Discrete Cosine transformation,
> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
> regression.
> - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
> pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
> test.
> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
> - More efficient Pregel API implementation for GraphX
> - Model summary for linear and logistic regression.
> - Python API: distributed matrices, streaming k-means and linear models,
> LDA, power iteration clustering, etc.
> - Tuning and evaluation: train-validation split and multiclass
> classification evaluator.
> - Documentation: document the release version of public API methods
>
>
>