You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2015/12/02 21:26:53 UTC

[VOTE] Release Apache Spark 1.6.0 (RC1)

Please vote on releasing the following candidate as Apache Spark version
1.6.0!

The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is *v1.6.0-rc1
(bf525845cef159d2d4c9f4d64e158f037179b5c4)
<https://github.com/apache/spark/tree/v1.6.0-rc1>*

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1165/

The test repository (versioned as v1.6.0-rc1) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1164/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/


=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.

===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.


==================================================
== Major changes to help you focus your testing ==
==================================================

Spark SQL

   - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
   Session Management - The ability to create multiple isolated SQL
   Contexts that have their own configuration and default database.  This is
   turned on by default in the thrift server.
   - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
   API - A type-safe API (similar to RDDs) that performs many operations on
   serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
   Memory Management - Shared memory for execution and caching instead of
   exclusive division of the regions.
   - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
   Queries on Files - Concise syntax for running SQL queries over files of
   any supported format without registering a table.
   - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
   non-standard JSON files - Added options to read non-standard JSON files
   (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator
   Metics for SQL Execution - Display statistics on a per-operator basis
   for memory usage and spilled data size.
   - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
   (*) expansion for StructTypes - Makes it easier to nest and unest
   arbitrary numbers of columns
   - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
   SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
   Columnar Cache Performance - Significant (up to 14x) speed up when
   caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
   null-safe joins - Joins using null-safe equality (<=>) will now execute
   using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
   Execution Using Off-Heap Memory - Support for configuring query
   execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
   API Avoid Double Filter - When implementing a datasource with filter
   pushdown, developers can now tell Spark SQL to avoid double evaluating a
   pushed-down filter.
   - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
   Layout of Cached Data - storing partitioning and ordering schemes in
   In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
   query execution - Initial support for automatically selecting the number
   of reducers for joins and aggregations.

Spark Streaming

   - API Updates
      - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
      improved state management - trackStateByKey - a DStream
      transformation for stateful stream processing, supersedes
      updateStateByKey in functionality and performance.
      - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
      record deaggregation - Kinesis streams have been upgraded to use KCL
      1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
      message handler function - Allows arbitrary function to be applied to
      a Kinesis record in the Kinesis receiver before to customize what data is
      to be stored in memory.
      - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
       Python Streaming Listener API - Get streaming statistics (scheduling
      delays, batch processing times, etc.) in streaming.


   - UI Improvements
      - Made failures visible in the streaming tab, in the timelines, batch
      list, and batch details page.
      - Made output operations visible in the streaming tab as progress bars

MLlibNew algorithms/models

   - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
   analysis - Log-linear model for survival analysis
   - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
   equation for least squares - Normal equation solver, providing R-like
   model summary statistics
   - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
   hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
   feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
   transformer
   - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
   K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

   - ML Pipelines
      - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
      persistence - Save/load for ML Pipelines, with partial coverage of
      spark.ml algorithms
      - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
      in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
   - R API
      - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
      statistics for GLMs - (Partial) R-like stats for ordinary least
      squares via summary(model)
      - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
      interactions in R formula - Interaction operator ":" in R formula
   - Python API - Many improvements to Python API to approach feature parity

Misc improvements

   - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
   SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
   weights for GLMs - Logistic and Linear Regression can take instance
   weights
   - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
   SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
   and bivariate statistics in DataFrames - Variance, stddev, correlations,
   etc.
   - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
   data source - LIBSVM as a SQL data sourceDocumentation improvements
   - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
   versions - Documentation includes initial version when classes and
   methods were added
   - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
   example code - Automated testing for code in user guide examples

Deprecations

   - In spark.mllib.clustering.KMeans, the "runs" parameter has been
   deprecated.
   - In spark.ml.classification.LogisticRegressionModel and
   spark.ml.regression.LinearRegressionModel, the "weights" field has been
   deprecated, in favor of the new name "coefficients." This helps
   disambiguate from instance (row) weights given to algorithms.

Changes of behavior

   - spark.mllib.tree.GradientBoostedTrees validationTol has changed
   semantics in 1.6. Previously, it was a threshold for absolute change in
   error. Now, it resembles the behavior of GradientDescent convergenceTol:
   For large errors, it uses relative error (relative to the previous error);
   for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert
   strings to lowercase before tokenizing. Now, it converts to lowercase by
   default, with an option not to. This matches the behavior of the simpler
   Tokenizer transformer.

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Ted Yu <yu...@gmail.com>.

+1

Ran through test suite (minus docker-integration-tests) which passed.

Overall experience was much better compared with some of the prior RC's.

[INFO] Spark Project External Kafka ....................... SUCCESS [
53.956 s]
[INFO] Spark Project Examples ............................. SUCCESS [02:05
min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [
11.298 s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 01:42 h
[INFO] Finished at: 2015-12-02T17:19:02-08:00

On Wed, Dec 2, 2015 at 4:23 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> I'm going to kick the voting off with a +1 (binding).  We ran TPC-DS and
> most queries are faster than 1.5.  We've also ported several production
> pipelines to 1.6.
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Michael Armbrust <mi...@databricks.com>.

I'm going to kick the voting off with a +1 (binding).  We ran TPC-DS and
most queries are faster than 1.5.  We've also ported several production
pipelines to 1.6.

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Ted Yu <yu...@gmail.com>.

I tried to run test suite and encountered the following:

http://pastebin.com/DPnwMGrm

FYI

On Wed, Dec 2, 2015 at 12:39 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> -0
>
> If spark-ec2 is still a supported part of the project, then we should
> update its version lists as new releases are made. 1.5.2 had the same issue.
>
> https://github.com/apache/spark/blob/v1.6.0-rc1/ec2/spark_ec2.py#L54-L91
>
> (I guess as part of the 2.0 discussions we should continue to discuss
> whether spark-ec2 still belongs in the project. I'm starting to feel
> awkward reporting spark-ec2 release issues...)
>
> Nick
>
> On Wed, Dec 2, 2015 at 3:27 PM Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc1
>> (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>> <https://github.com/apache/spark/tree/v1.6.0-rc1>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1165/
>>
>> The test repository (versioned as v1.6.0-rc1) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1164/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>>
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark SQL
>>
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - The ability to create multiple isolated SQL
>>    Contexts that have their own configuration and default database.  This is
>>    turned on by default in the thrift server.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metics for SQL Execution - Display statistics on a per-operator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implementing a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Initial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - trackStateByKey - a DStream
>>       transformation for stateful stream processing, supersedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitrary function to be applied
>>       to a Kinesis record in the Kinesis receiver before to customize what data
>>       is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>>        Python Streaming Listener API - Get streaming statistics
>>       (scheduling delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>
>>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Nicholas Chammas <ni...@gmail.com>.

-0

If spark-ec2 is still a supported part of the project, then we should
update its version lists as new releases are made. 1.5.2 had the same issue.

https://github.com/apache/spark/blob/v1.6.0-rc1/ec2/spark_ec2.py#L54-L91

(I guess as part of the 2.0 discussions we should continue to discuss
whether spark-ec2 still belongs in the project. I'm starting to feel
awkward reporting spark-ec2 release issues...)

Nick

On Wed, Dec 2, 2015 at 3:27 PM Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc1
> (bf525845cef159d2d4c9f4d64e158f037179b5c4)
> <https://github.com/apache/spark/tree/v1.6.0-rc1>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1165/
>
> The test repository (versioned as v1.6.0-rc1) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1164/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark SQL
>
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - The ability to create multiple isolated SQL
>    Contexts that have their own configuration and default database.  This is
>    turned on by default in the thrift server.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metics for SQL Execution - Display statistics on a per-operator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implementing a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Initial support for automatically selecting the
>    number of reducers for joins and aggregations.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - trackStateByKey - a DStream
>       transformation for stateful stream processing, supersedes
>       updateStateByKey in functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitrary function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>        Python Streaming Listener API - Get streaming statistics
>       (scheduling delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by robineast <ro...@xense.co.uk>.

+1

OSX 10.10.5, java version "1.8.0_40", scala 2.10

mvn clean package -DskipTests

[INFO] Spark Project External Kafka ....................... SUCCESS [ 18.161
s]
[INFO] Spark Project Examples ............................. SUCCESS [01:18
min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [  5.724
s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 14:59 min
[INFO] Finished at: 2015-12-03T09:46:38+00:00
[INFO] Final Memory: 105M/2668M
[INFO]
------------------------------------------------------------------------

Basic graph tests
  Load graph using edgeListFile...SUCCESS
  Run PageRank...SUCCESS
Connected Components tests
  Kaggle social circles competition...SUCCESS
Minimum Spanning Tree Algorithm
  Run basic Minimum Spanning Tree algorithm...SUCCESS
  Run Minimum Spanning Tree taxonomy creation...SUCCESS



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC1-tp15424p15449.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Michael Armbrust <mi...@databricks.com>.

Trying again now that eec36607
<https://github.com/apache/spark/commit/eec36607f9fc92b6c4d306e3930fcf03961625eb>
is
merged.

On Thu, Dec 10, 2015 at 6:44 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Cutting RC2 now.
>
> On Thu, Dec 10, 2015 at 12:59 PM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> We are getting close to merging patches for SPARK-12155
>> <https://issues.apache.org/jira/browse/SPARK-12155> and SPARK-12253
>> <https://issues.apache.org/jira/browse/SPARK-12253>.  I'll be cutting
>> RC2 shortly after that.
>>
>> Michael
>>
>> On Tue, Dec 8, 2015 at 10:31 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> An update: the vote fails due to the -1.   I'll post another RC as soon
>>> as we've resolved these issues.  In the mean time I encourage people to
>>> continue testing and post any problems they encounter here.
>>>
>>> On Sun, Dec 6, 2015 at 6:24 PM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>>> -1
>>>>
>>>> Tow blocker bugs have been found after this RC.
>>>> https://issues.apache.org/jira/browse/SPARK-12089 can cause data
>>>> corruption when an external sorter spills data.
>>>> https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks
>>>> from acquiring memory even when the executor indeed can allocate memory by
>>>> evicting storage memory.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We
>>>> are still working on https://issues.apache.org/jira/browse/SPARK-12155.
>>>>
>>>> On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> 0
>>>>>
>>>>> Currently figuring out who is responsible for the regression that I am
>>>>> seeing in some user code ScalaUDFs that make use of Timestamps and where
>>>>> NULL from a CSV file read in via a TestHive#registerTestTable is now
>>>>> producing 1969-12-31 23:59:59.999999 instead of null.
>>>>>
>>>>> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>>> Licenses and signature are all fine.
>>>>>>
>>>>>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>>>>>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>>>>>
>>>>>> *** RUN ABORTED ***
>>>>>>   java.lang.NoSuchMethodError:
>>>>>>
>>>>>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>>>>>   at
>>>>>> org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
>>>>>>   at
>>>>>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>>>>>   at
>>>>>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>>>>>   at
>>>>>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>>>>>
>>>>>> I also get this failure consistently:
>>>>>>
>>>>>> DirectKafkaStreamSuite
>>>>>> - offset recovery *** FAILED ***
>>>>>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>>>>>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>>>>>
>>>>>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>>>>>
>>>>>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>>>>>
>>>>>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
>>>>>> was false Recovered ranges are not the same as the ones generated
>>>>>> (DirectKafkaStreamSuite.scala:301)
>>>>>>
>>>>>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>>> version
>>>>>> > 1.6.0!
>>>>>> >
>>>>>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>>>>>> passes if
>>>>>> > a majority of at least 3 +1 PMC votes are cast.
>>>>>> >
>>>>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>>>>> > [ ] -1 Do not release this package because ...
>>>>>> >
>>>>>> > To learn more about Apache Spark, please see
>>>>>> http://spark.apache.org/
>>>>>> >
>>>>>> > The tag to be voted on is v1.6.0-rc1
>>>>>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>>>>>> >
>>>>>> > The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> >
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>>>>>> >
>>>>>> > Release artifacts are signed with the following key:
>>>>>> > https://people.apache.org/keys/committer/pwendell.asc
>>>>>> >
>>>>>> > The staging repository for this release can be found at:
>>>>>> >
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1165/
>>>>>> >
>>>>>> > The test repository (versioned as v1.6.0-rc1) for this release can
>>>>>> be found
>>>>>> > at:
>>>>>> >
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1164/
>>>>>> >
>>>>>> > The documentation corresponding to this release can be found at:
>>>>>> >
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>>>>>> >
>>>>>> >
>>>>>> > =======================================
>>>>>> > == How can I help test this release? ==
>>>>>> > =======================================
>>>>>> > If you are a Spark user, you can help us test this release by
>>>>>> taking an
>>>>>> > existing Spark workload and running on this release candidate, then
>>>>>> > reporting any regressions.
>>>>>> >
>>>>>> > ================================================
>>>>>> > == What justifies a -1 vote for this release? ==
>>>>>> > ================================================
>>>>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>>>>> votes
>>>>>> > should only occur for significant regressions from 1.5. Bugs
>>>>>> already present
>>>>>> > in 1.5, minor regressions, or bugs related to new features will not
>>>>>> block
>>>>>> > this release.
>>>>>> >
>>>>>> > ===============================================================
>>>>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>>> > ===============================================================
>>>>>> > 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>>> into
>>>>>> > branch-1.6, since documentations will be published separately from
>>>>>> the
>>>>>> > release.
>>>>>> > 2. New features for non-alpha-modules should target 1.7+.
>>>>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>>> target
>>>>>> > version.
>>>>>> >
>>>>>> >
>>>>>> > ==================================================
>>>>>> > == Major changes to help you focus your testing ==
>>>>>> > ==================================================
>>>>>> >
>>>>>> > Spark SQL
>>>>>> >
>>>>>> > SPARK-10810 Session Management - The ability to create multiple
>>>>>> isolated SQL
>>>>>> > Contexts that have their own configuration and default database.
>>>>>> This is
>>>>>> > turned on by default in the thrift server.
>>>>>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>>>>>> performs
>>>>>> > many operations on serialized binary data and code generation (i.e.
>>>>>> Project
>>>>>> > Tungsten).
>>>>>> > SPARK-10000 Unified Memory Management - Shared memory for execution
>>>>>> and
>>>>>> > caching instead of exclusive division of the regions.
>>>>>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>>>>>> queries
>>>>>> > over files of any supported format without registering a table.
>>>>>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>>>>>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>>>>> > SPARK-10412 Per-operator Metics for SQL Execution - Display
>>>>>> statistics on a
>>>>>> > per-operator basis for memory usage and spilled data size.
>>>>>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>>>>>> nest and
>>>>>> > unest arbitrary numbers of columns
>>>>>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>>>>>> Significant
>>>>>> > (up to 14x) speed up when caching data that contains complex types
>>>>>> in
>>>>>> > DataFrames or SQL.
>>>>>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
>>>>>> (<=>) will
>>>>>> > now execute using SortMergeJoin instead of computing a cartisian
>>>>>> product.
>>>>>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>>>>>> configuring
>>>>>> > query execution to occur using off-heap memory to avoid GC overhead
>>>>>> > SPARK-10978 Datasource API Avoid Double Filter - When implementing a
>>>>>> > datasource with filter pushdown, developers can now tell Spark SQL
>>>>>> to avoid
>>>>>> > double evaluating a pushed-down filter.
>>>>>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning
>>>>>> and
>>>>>> > ordering schemes in In-memory table scan, and adding distributeBy
>>>>>> and
>>>>>> > localSort to DF API
>>>>>> > SPARK-9858  Adaptive query execution - Initial support for
>>>>>> automatically
>>>>>> > selecting the number of reducers for joins and aggregations.
>>>>>> >
>>>>>> > Spark Streaming
>>>>>> >
>>>>>> > API Updates
>>>>>> >
>>>>>> > SPARK-2629  New improved state management - trackStateByKey - a
>>>>>> DStream
>>>>>> > transformation for stateful stream processing, supersedes
>>>>>> updateStateByKey
>>>>>> > in functionality and performance.
>>>>>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>>>>>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>>> > KPL-aggregated records.
>>>>>> > SPARK-10891 Kinesis message handler function - Allows arbitrary
>>>>>> function to
>>>>>> > be applied to a Kinesis record in the Kinesis receiver before to
>>>>>> customize
>>>>>> > what data is to be stored in memory.
>>>>>> > SPARK-6328  Python Streaming Listener API - Get streaming statistics
>>>>>> > (scheduling delays, batch processing times, etc.) in streaming.
>>>>>> >
>>>>>> > UI Improvements
>>>>>> >
>>>>>> > Made failures visible in the streaming tab, in the timelines, batch
>>>>>> list,
>>>>>> > and batch details page.
>>>>>> > Made output operations visible in the streaming tab as progress bars
>>>>>> >
>>>>>> > MLlib
>>>>>> >
>>>>>> > New algorithms/models
>>>>>> >
>>>>>> > SPARK-8518  Survival analysis - Log-linear model for survival
>>>>>> analysis
>>>>>> > SPARK-9834  Normal equation for least squares - Normal equation
>>>>>> solver,
>>>>>> > providing R-like model summary statistics
>>>>>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>>>>>> Streaming
>>>>>> > framework
>>>>>> > SPARK-9930  New feature transformers - ChiSqSelector,
>>>>>> QuantileDiscretizer,
>>>>>> > SQL transformer
>>>>>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>>>>>> variant
>>>>>> > of K-Means
>>>>>> >
>>>>>> > API improvements
>>>>>> >
>>>>>> > ML Pipelines
>>>>>> >
>>>>>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>>>>>> partial
>>>>>> > coverage of spark.ml algorithms
>>>>>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet
>>>>>> Allocation in ML
>>>>>> > Pipelines
>>>>>> >
>>>>>> > R API
>>>>>> >
>>>>>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>>>>>> ordinary
>>>>>> > least squares via summary(model)
>>>>>> > SPARK-9681  Feature interactions in R formula - Interaction
>>>>>> operator ":" in
>>>>>> > R formula
>>>>>> >
>>>>>> > Python API - Many improvements to Python API to approach feature
>>>>>> parity
>>>>>> >
>>>>>> > Misc improvements
>>>>>> >
>>>>>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
>>>>>> Linear
>>>>>> > Regression can take instance weights
>>>>>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>>>>>> DataFrames -
>>>>>> > Variance, stddev, correlations, etc.
>>>>>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>>>>>> >
>>>>>> > Documentation improvements
>>>>>> >
>>>>>> > SPARK-7751  @since versions - Documentation includes initial
>>>>>> version when
>>>>>> > classes and methods were added
>>>>>> > SPARK-11337 Testable example code - Automated testing for code in
>>>>>> user guide
>>>>>> > examples
>>>>>> >
>>>>>> > Deprecations
>>>>>> >
>>>>>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>> deprecated.
>>>>>> > In spark.ml.classification.LogisticRegressionModel and
>>>>>> > spark.ml.regression.LinearRegressionModel, the "weights" field has
>>>>>> been
>>>>>> > deprecated, in favor of the new name "coefficients." This helps
>>>>>> disambiguate
>>>>>> > from instance (row) weights given to algorithms.
>>>>>> >
>>>>>> > Changes of behavior
>>>>>> >
>>>>>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>> semantics in
>>>>>> > 1.6. Previously, it was a threshold for absolute change in error.
>>>>>> Now, it
>>>>>> > resembles the behavior of GradientDescent convergenceTol: For large
>>>>>> errors,
>>>>>> > it uses relative error (relative to the previous error); for small
>>>>>> errors (<
>>>>>> > 0.01), it uses absolute error.
>>>>>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>> strings to
>>>>>> > lowercase before tokenizing. Now, it converts to lowercase by
>>>>>> default, with
>>>>>> > an option not to. This matches the behavior of the simpler Tokenizer
>>>>>> > transformer.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Michael Armbrust <mi...@databricks.com>.

Cutting RC2 now.

On Thu, Dec 10, 2015 at 12:59 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> We are getting close to merging patches for SPARK-12155
> <https://issues.apache.org/jira/browse/SPARK-12155> and SPARK-12253
> <https://issues.apache.org/jira/browse/SPARK-12253>.  I'll be cutting RC2
> shortly after that.
>
> Michael
>
> On Tue, Dec 8, 2015 at 10:31 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> An update: the vote fails due to the -1.   I'll post another RC as soon
>> as we've resolved these issues.  In the mean time I encourage people to
>> continue testing and post any problems they encounter here.
>>
>> On Sun, Dec 6, 2015 at 6:24 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> -1
>>>
>>> Tow blocker bugs have been found after this RC.
>>> https://issues.apache.org/jira/browse/SPARK-12089 can cause data
>>> corruption when an external sorter spills data.
>>> https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks
>>> from acquiring memory even when the executor indeed can allocate memory by
>>> evicting storage memory.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We
>>> are still working on https://issues.apache.org/jira/browse/SPARK-12155.
>>>
>>> On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> 0
>>>>
>>>> Currently figuring out who is responsible for the regression that I am
>>>> seeing in some user code ScalaUDFs that make use of Timestamps and where
>>>> NULL from a CSV file read in via a TestHive#registerTestTable is now
>>>> producing 1969-12-31 23:59:59.999999 instead of null.
>>>>
>>>> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> Licenses and signature are all fine.
>>>>>
>>>>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>>>>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>>>>
>>>>> *** RUN ABORTED ***
>>>>>   java.lang.NoSuchMethodError:
>>>>>
>>>>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>>>>   at
>>>>> org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
>>>>>   at
>>>>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>>>>   at
>>>>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>>>>   at
>>>>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>>>>   at
>>>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>>>>   at
>>>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>>>>   at
>>>>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>>>>   at
>>>>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>>>>   at
>>>>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>>>>   at
>>>>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>>>>
>>>>> I also get this failure consistently:
>>>>>
>>>>> DirectKafkaStreamSuite
>>>>> - offset recovery *** FAILED ***
>>>>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>>>>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>>>>
>>>>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>>>>
>>>>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>>>>
>>>>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
>>>>> was false Recovered ranges are not the same as the ones generated
>>>>> (DirectKafkaStreamSuite.scala:301)
>>>>>
>>>>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>> version
>>>>> > 1.6.0!
>>>>> >
>>>>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>>>>> passes if
>>>>> > a majority of at least 3 +1 PMC votes are cast.
>>>>> >
>>>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>>>> > [ ] -1 Do not release this package because ...
>>>>> >
>>>>> > To learn more about Apache Spark, please see
>>>>> http://spark.apache.org/
>>>>> >
>>>>> > The tag to be voted on is v1.6.0-rc1
>>>>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>>>>> >
>>>>> > The release files, including signatures, digests, etc. can be found
>>>>> at:
>>>>> >
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>>>>> >
>>>>> > Release artifacts are signed with the following key:
>>>>> > https://people.apache.org/keys/committer/pwendell.asc
>>>>> >
>>>>> > The staging repository for this release can be found at:
>>>>> >
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1165/
>>>>> >
>>>>> > The test repository (versioned as v1.6.0-rc1) for this release can
>>>>> be found
>>>>> > at:
>>>>> >
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1164/
>>>>> >
>>>>> > The documentation corresponding to this release can be found at:
>>>>> >
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>>>>> >
>>>>> >
>>>>> > =======================================
>>>>> > == How can I help test this release? ==
>>>>> > =======================================
>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>> an
>>>>> > existing Spark workload and running on this release candidate, then
>>>>> > reporting any regressions.
>>>>> >
>>>>> > ================================================
>>>>> > == What justifies a -1 vote for this release? ==
>>>>> > ================================================
>>>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>>>> votes
>>>>> > should only occur for significant regressions from 1.5. Bugs already
>>>>> present
>>>>> > in 1.5, minor regressions, or bugs related to new features will not
>>>>> block
>>>>> > this release.
>>>>> >
>>>>> > ===============================================================
>>>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>> > ===============================================================
>>>>> > 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>> into
>>>>> > branch-1.6, since documentations will be published separately from
>>>>> the
>>>>> > release.
>>>>> > 2. New features for non-alpha-modules should target 1.7+.
>>>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>> target
>>>>> > version.
>>>>> >
>>>>> >
>>>>> > ==================================================
>>>>> > == Major changes to help you focus your testing ==
>>>>> > ==================================================
>>>>> >
>>>>> > Spark SQL
>>>>> >
>>>>> > SPARK-10810 Session Management - The ability to create multiple
>>>>> isolated SQL
>>>>> > Contexts that have their own configuration and default database.
>>>>> This is
>>>>> > turned on by default in the thrift server.
>>>>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>>>>> performs
>>>>> > many operations on serialized binary data and code generation (i.e.
>>>>> Project
>>>>> > Tungsten).
>>>>> > SPARK-10000 Unified Memory Management - Shared memory for execution
>>>>> and
>>>>> > caching instead of exclusive division of the regions.
>>>>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>>>>> queries
>>>>> > over files of any supported format without registering a table.
>>>>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>>>>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>>>> > SPARK-10412 Per-operator Metics for SQL Execution - Display
>>>>> statistics on a
>>>>> > per-operator basis for memory usage and spilled data size.
>>>>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>>>>> nest and
>>>>> > unest arbitrary numbers of columns
>>>>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>>>>> Significant
>>>>> > (up to 14x) speed up when caching data that contains complex types in
>>>>> > DataFrames or SQL.
>>>>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
>>>>> (<=>) will
>>>>> > now execute using SortMergeJoin instead of computing a cartisian
>>>>> product.
>>>>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>>>>> configuring
>>>>> > query execution to occur using off-heap memory to avoid GC overhead
>>>>> > SPARK-10978 Datasource API Avoid Double Filter - When implementing a
>>>>> > datasource with filter pushdown, developers can now tell Spark SQL
>>>>> to avoid
>>>>> > double evaluating a pushed-down filter.
>>>>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>>>>> > ordering schemes in In-memory table scan, and adding distributeBy and
>>>>> > localSort to DF API
>>>>> > SPARK-9858  Adaptive query execution - Initial support for
>>>>> automatically
>>>>> > selecting the number of reducers for joins and aggregations.
>>>>> >
>>>>> > Spark Streaming
>>>>> >
>>>>> > API Updates
>>>>> >
>>>>> > SPARK-2629  New improved state management - trackStateByKey - a
>>>>> DStream
>>>>> > transformation for stateful stream processing, supersedes
>>>>> updateStateByKey
>>>>> > in functionality and performance.
>>>>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>>>>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>> > KPL-aggregated records.
>>>>> > SPARK-10891 Kinesis message handler function - Allows arbitrary
>>>>> function to
>>>>> > be applied to a Kinesis record in the Kinesis receiver before to
>>>>> customize
>>>>> > what data is to be stored in memory.
>>>>> > SPARK-6328  Python Streaming Listener API - Get streaming statistics
>>>>> > (scheduling delays, batch processing times, etc.) in streaming.
>>>>> >
>>>>> > UI Improvements
>>>>> >
>>>>> > Made failures visible in the streaming tab, in the timelines, batch
>>>>> list,
>>>>> > and batch details page.
>>>>> > Made output operations visible in the streaming tab as progress bars
>>>>> >
>>>>> > MLlib
>>>>> >
>>>>> > New algorithms/models
>>>>> >
>>>>> > SPARK-8518  Survival analysis - Log-linear model for survival
>>>>> analysis
>>>>> > SPARK-9834  Normal equation for least squares - Normal equation
>>>>> solver,
>>>>> > providing R-like model summary statistics
>>>>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>>>>> Streaming
>>>>> > framework
>>>>> > SPARK-9930  New feature transformers - ChiSqSelector,
>>>>> QuantileDiscretizer,
>>>>> > SQL transformer
>>>>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>>>>> variant
>>>>> > of K-Means
>>>>> >
>>>>> > API improvements
>>>>> >
>>>>> > ML Pipelines
>>>>> >
>>>>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>>>>> partial
>>>>> > coverage of spark.ml algorithms
>>>>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet
>>>>> Allocation in ML
>>>>> > Pipelines
>>>>> >
>>>>> > R API
>>>>> >
>>>>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>>>>> ordinary
>>>>> > least squares via summary(model)
>>>>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>>>>> ":" in
>>>>> > R formula
>>>>> >
>>>>> > Python API - Many improvements to Python API to approach feature
>>>>> parity
>>>>> >
>>>>> > Misc improvements
>>>>> >
>>>>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
>>>>> Linear
>>>>> > Regression can take instance weights
>>>>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>>>>> DataFrames -
>>>>> > Variance, stddev, correlations, etc.
>>>>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>>>>> >
>>>>> > Documentation improvements
>>>>> >
>>>>> > SPARK-7751  @since versions - Documentation includes initial version
>>>>> when
>>>>> > classes and methods were added
>>>>> > SPARK-11337 Testable example code - Automated testing for code in
>>>>> user guide
>>>>> > examples
>>>>> >
>>>>> > Deprecations
>>>>> >
>>>>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>> deprecated.
>>>>> > In spark.ml.classification.LogisticRegressionModel and
>>>>> > spark.ml.regression.LinearRegressionModel, the "weights" field has
>>>>> been
>>>>> > deprecated, in favor of the new name "coefficients." This helps
>>>>> disambiguate
>>>>> > from instance (row) weights given to algorithms.
>>>>> >
>>>>> > Changes of behavior
>>>>> >
>>>>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>> semantics in
>>>>> > 1.6. Previously, it was a threshold for absolute change in error.
>>>>> Now, it
>>>>> > resembles the behavior of GradientDescent convergenceTol: For large
>>>>> errors,
>>>>> > it uses relative error (relative to the previous error); for small
>>>>> errors (<
>>>>> > 0.01), it uses absolute error.
>>>>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>> strings to
>>>>> > lowercase before tokenizing. Now, it converts to lowercase by
>>>>> default, with
>>>>> > an option not to. This matches the behavior of the simpler Tokenizer
>>>>> > transformer.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Michael Armbrust <mi...@databricks.com>.

We are getting close to merging patches for SPARK-12155
<https://issues.apache.org/jira/browse/SPARK-12155> and SPARK-12253
<https://issues.apache.org/jira/browse/SPARK-12253>.  I'll be cutting RC2
shortly after that.

Michael

On Tue, Dec 8, 2015 at 10:31 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> An update: the vote fails due to the -1.   I'll post another RC as soon as
> we've resolved these issues.  In the mean time I encourage people to
> continue testing and post any problems they encounter here.
>
> On Sun, Dec 6, 2015 at 6:24 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> -1
>>
>> Tow blocker bugs have been found after this RC.
>> https://issues.apache.org/jira/browse/SPARK-12089 can cause data
>> corruption when an external sorter spills data.
>> https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks from
>> acquiring memory even when the executor indeed can allocate memory by
>> evicting storage memory.
>>
>> https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We are
>> still working on https://issues.apache.org/jira/browse/SPARK-12155.
>>
>> On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> 0
>>>
>>> Currently figuring out who is responsible for the regression that I am
>>> seeing in some user code ScalaUDFs that make use of Timestamps and where
>>> NULL from a CSV file read in via a TestHive#registerTestTable is now
>>> producing 1969-12-31 23:59:59.999999 instead of null.
>>>
>>> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Licenses and signature are all fine.
>>>>
>>>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>>>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>>>
>>>> *** RUN ABORTED ***
>>>>   java.lang.NoSuchMethodError:
>>>>
>>>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>>>   at
>>>> org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
>>>>   at
>>>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>>>   at
>>>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>>>   at
>>>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>>>   at
>>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>>>   at
>>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>>>   at
>>>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>>>   at
>>>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>>>   at
>>>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>>>   at
>>>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>>>
>>>> I also get this failure consistently:
>>>>
>>>> DirectKafkaStreamSuite
>>>> - offset recovery *** FAILED ***
>>>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>>>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>>>
>>>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>>>
>>>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>>>
>>>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
>>>> was false Recovered ranges are not the same as the ones generated
>>>> (DirectKafkaStreamSuite.scala:301)
>>>>
>>>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>> > Please vote on releasing the following candidate as Apache Spark
>>>> version
>>>> > 1.6.0!
>>>> >
>>>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>>>> passes if
>>>> > a majority of at least 3 +1 PMC votes are cast.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> >
>>>> > The tag to be voted on is v1.6.0-rc1
>>>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>>>> >
>>>> > Release artifacts are signed with the following key:
>>>> > https://people.apache.org/keys/committer/pwendell.asc
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1165/
>>>> >
>>>> > The test repository (versioned as v1.6.0-rc1) for this release can be
>>>> found
>>>> > at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1164/
>>>> >
>>>> > The documentation corresponding to this release can be found at:
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>>>> >
>>>> >
>>>> > =======================================
>>>> > == How can I help test this release? ==
>>>> > =======================================
>>>> > If you are a Spark user, you can help us test this release by taking
>>>> an
>>>> > existing Spark workload and running on this release candidate, then
>>>> > reporting any regressions.
>>>> >
>>>> > ================================================
>>>> > == What justifies a -1 vote for this release? ==
>>>> > ================================================
>>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes
>>>> > should only occur for significant regressions from 1.5. Bugs already
>>>> present
>>>> > in 1.5, minor regressions, or bugs related to new features will not
>>>> block
>>>> > this release.
>>>> >
>>>> > ===============================================================
>>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> > ===============================================================
>>>> > 1. It is OK for documentation patches to target 1.6.0 and still go
>>>> into
>>>> > branch-1.6, since documentations will be published separately from the
>>>> > release.
>>>> > 2. New features for non-alpha-modules should target 1.7+.
>>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target
>>>> > version.
>>>> >
>>>> >
>>>> > ==================================================
>>>> > == Major changes to help you focus your testing ==
>>>> > ==================================================
>>>> >
>>>> > Spark SQL
>>>> >
>>>> > SPARK-10810 Session Management - The ability to create multiple
>>>> isolated SQL
>>>> > Contexts that have their own configuration and default database.
>>>> This is
>>>> > turned on by default in the thrift server.
>>>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>>>> performs
>>>> > many operations on serialized binary data and code generation (i.e.
>>>> Project
>>>> > Tungsten).
>>>> > SPARK-10000 Unified Memory Management - Shared memory for execution
>>>> and
>>>> > caching instead of exclusive division of the regions.
>>>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>>>> queries
>>>> > over files of any supported format without registering a table.
>>>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>>>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>>> > SPARK-10412 Per-operator Metics for SQL Execution - Display
>>>> statistics on a
>>>> > per-operator basis for memory usage and spilled data size.
>>>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>>>> nest and
>>>> > unest arbitrary numbers of columns
>>>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>>>> Significant
>>>> > (up to 14x) speed up when caching data that contains complex types in
>>>> > DataFrames or SQL.
>>>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
>>>> (<=>) will
>>>> > now execute using SortMergeJoin instead of computing a cartisian
>>>> product.
>>>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>>>> configuring
>>>> > query execution to occur using off-heap memory to avoid GC overhead
>>>> > SPARK-10978 Datasource API Avoid Double Filter - When implementing a
>>>> > datasource with filter pushdown, developers can now tell Spark SQL to
>>>> avoid
>>>> > double evaluating a pushed-down filter.
>>>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>>>> > ordering schemes in In-memory table scan, and adding distributeBy and
>>>> > localSort to DF API
>>>> > SPARK-9858  Adaptive query execution - Initial support for
>>>> automatically
>>>> > selecting the number of reducers for joins and aggregations.
>>>> >
>>>> > Spark Streaming
>>>> >
>>>> > API Updates
>>>> >
>>>> > SPARK-2629  New improved state management - trackStateByKey - a
>>>> DStream
>>>> > transformation for stateful stream processing, supersedes
>>>> updateStateByKey
>>>> > in functionality and performance.
>>>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>>>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>> > KPL-aggregated records.
>>>> > SPARK-10891 Kinesis message handler function - Allows arbitrary
>>>> function to
>>>> > be applied to a Kinesis record in the Kinesis receiver before to
>>>> customize
>>>> > what data is to be stored in memory.
>>>> > SPARK-6328  Python Streaming Listener API - Get streaming statistics
>>>> > (scheduling delays, batch processing times, etc.) in streaming.
>>>> >
>>>> > UI Improvements
>>>> >
>>>> > Made failures visible in the streaming tab, in the timelines, batch
>>>> list,
>>>> > and batch details page.
>>>> > Made output operations visible in the streaming tab as progress bars
>>>> >
>>>> > MLlib
>>>> >
>>>> > New algorithms/models
>>>> >
>>>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>>>> > SPARK-9834  Normal equation for least squares - Normal equation
>>>> solver,
>>>> > providing R-like model summary statistics
>>>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>>>> Streaming
>>>> > framework
>>>> > SPARK-9930  New feature transformers - ChiSqSelector,
>>>> QuantileDiscretizer,
>>>> > SQL transformer
>>>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>>>> variant
>>>> > of K-Means
>>>> >
>>>> > API improvements
>>>> >
>>>> > ML Pipelines
>>>> >
>>>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>>>> partial
>>>> > coverage of spark.ml algorithms
>>>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
>>>> in ML
>>>> > Pipelines
>>>> >
>>>> > R API
>>>> >
>>>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>>>> ordinary
>>>> > least squares via summary(model)
>>>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>>>> ":" in
>>>> > R formula
>>>> >
>>>> > Python API - Many improvements to Python API to approach feature
>>>> parity
>>>> >
>>>> > Misc improvements
>>>> >
>>>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
>>>> Linear
>>>> > Regression can take instance weights
>>>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>>>> DataFrames -
>>>> > Variance, stddev, correlations, etc.
>>>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>>>> >
>>>> > Documentation improvements
>>>> >
>>>> > SPARK-7751  @since versions - Documentation includes initial version
>>>> when
>>>> > classes and methods were added
>>>> > SPARK-11337 Testable example code - Automated testing for code in
>>>> user guide
>>>> > examples
>>>> >
>>>> > Deprecations
>>>> >
>>>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>> deprecated.
>>>> > In spark.ml.classification.LogisticRegressionModel and
>>>> > spark.ml.regression.LinearRegressionModel, the "weights" field has
>>>> been
>>>> > deprecated, in favor of the new name "coefficients." This helps
>>>> disambiguate
>>>> > from instance (row) weights given to algorithms.
>>>> >
>>>> > Changes of behavior
>>>> >
>>>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>> semantics in
>>>> > 1.6. Previously, it was a threshold for absolute change in error.
>>>> Now, it
>>>> > resembles the behavior of GradientDescent convergenceTol: For large
>>>> errors,
>>>> > it uses relative error (relative to the previous error); for small
>>>> errors (<
>>>> > 0.01), it uses absolute error.
>>>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>> strings to
>>>> > lowercase before tokenizing. Now, it converts to lowercase by
>>>> default, with
>>>> > an option not to. This matches the behavior of the simpler Tokenizer
>>>> > transformer.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Michael Armbrust <mi...@databricks.com>.

An update: the vote fails due to the -1.   I'll post another RC as soon as
we've resolved these issues.  In the mean time I encourage people to
continue testing and post any problems they encounter here.

On Sun, Dec 6, 2015 at 6:24 PM, Yin Huai <yh...@databricks.com> wrote:

> -1
>
> Tow blocker bugs have been found after this RC.
> https://issues.apache.org/jira/browse/SPARK-12089 can cause data
> corruption when an external sorter spills data.
> https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks from
> acquiring memory even when the executor indeed can allocate memory by
> evicting storage memory.
>
> https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We are
> still working on https://issues.apache.org/jira/browse/SPARK-12155.
>
> On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> 0
>>
>> Currently figuring out who is responsible for the regression that I am
>> seeing in some user code ScalaUDFs that make use of Timestamps and where
>> NULL from a CSV file read in via a TestHive#registerTestTable is now
>> producing 1969-12-31 23:59:59.999999 instead of null.
>>
>> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Licenses and signature are all fine.
>>>
>>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>>
>>> *** RUN ABORTED ***
>>>   java.lang.NoSuchMethodError:
>>>
>>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>>   at
>>> org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
>>>   at
>>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>>   at
>>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>>   at
>>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>>   at
>>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>>   at
>>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>>
>>> I also get this failure consistently:
>>>
>>> DirectKafkaStreamSuite
>>> - offset recovery *** FAILED ***
>>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>>
>>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>>
>>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>>
>>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
>>> was false Recovered ranges are not the same as the ones generated
>>> (DirectKafkaStreamSuite.scala:301)
>>>
>>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version
>>> > 1.6.0!
>>> >
>>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>>> passes if
>>> > a majority of at least 3 +1 PMC votes are cast.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v1.6.0-rc1
>>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>>> >
>>> > Release artifacts are signed with the following key:
>>> > https://people.apache.org/keys/committer/pwendell.asc
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1165/
>>> >
>>> > The test repository (versioned as v1.6.0-rc1) for this release can be
>>> found
>>> > at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1164/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>>> >
>>> >
>>> > =======================================
>>> > == How can I help test this release? ==
>>> > =======================================
>>> > If you are a Spark user, you can help us test this release by taking an
>>> > existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > ================================================
>>> > == What justifies a -1 vote for this release? ==
>>> > ================================================
>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>> votes
>>> > should only occur for significant regressions from 1.5. Bugs already
>>> present
>>> > in 1.5, minor regressions, or bugs related to new features will not
>>> block
>>> > this release.
>>> >
>>> > ===============================================================
>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> > ===============================================================
>>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> > branch-1.6, since documentations will be published separately from the
>>> > release.
>>> > 2. New features for non-alpha-modules should target 1.7+.
>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target
>>> > version.
>>> >
>>> >
>>> > ==================================================
>>> > == Major changes to help you focus your testing ==
>>> > ==================================================
>>> >
>>> > Spark SQL
>>> >
>>> > SPARK-10810 Session Management - The ability to create multiple
>>> isolated SQL
>>> > Contexts that have their own configuration and default database.  This
>>> is
>>> > turned on by default in the thrift server.
>>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>>> performs
>>> > many operations on serialized binary data and code generation (i.e.
>>> Project
>>> > Tungsten).
>>> > SPARK-10000 Unified Memory Management - Shared memory for execution and
>>> > caching instead of exclusive division of the regions.
>>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>>> queries
>>> > over files of any supported format without registering a table.
>>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>> > SPARK-10412 Per-operator Metics for SQL Execution - Display statistics
>>> on a
>>> > per-operator basis for memory usage and spilled data size.
>>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>>> nest and
>>> > unest arbitrary numbers of columns
>>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>>> Significant
>>> > (up to 14x) speed up when caching data that contains complex types in
>>> > DataFrames or SQL.
>>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
>>> (<=>) will
>>> > now execute using SortMergeJoin instead of computing a cartisian
>>> product.
>>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>>> configuring
>>> > query execution to occur using off-heap memory to avoid GC overhead
>>> > SPARK-10978 Datasource API Avoid Double Filter - When implementing a
>>> > datasource with filter pushdown, developers can now tell Spark SQL to
>>> avoid
>>> > double evaluating a pushed-down filter.
>>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>>> > ordering schemes in In-memory table scan, and adding distributeBy and
>>> > localSort to DF API
>>> > SPARK-9858  Adaptive query execution - Initial support for
>>> automatically
>>> > selecting the number of reducers for joins and aggregations.
>>> >
>>> > Spark Streaming
>>> >
>>> > API Updates
>>> >
>>> > SPARK-2629  New improved state management - trackStateByKey - a DStream
>>> > transformation for stateful stream processing, supersedes
>>> updateStateByKey
>>> > in functionality and performance.
>>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>> > KPL-aggregated records.
>>> > SPARK-10891 Kinesis message handler function - Allows arbitrary
>>> function to
>>> > be applied to a Kinesis record in the Kinesis receiver before to
>>> customize
>>> > what data is to be stored in memory.
>>> > SPARK-6328  Python Streaming Listener API - Get streaming statistics
>>> > (scheduling delays, batch processing times, etc.) in streaming.
>>> >
>>> > UI Improvements
>>> >
>>> > Made failures visible in the streaming tab, in the timelines, batch
>>> list,
>>> > and batch details page.
>>> > Made output operations visible in the streaming tab as progress bars
>>> >
>>> > MLlib
>>> >
>>> > New algorithms/models
>>> >
>>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>>> > SPARK-9834  Normal equation for least squares - Normal equation solver,
>>> > providing R-like model summary statistics
>>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>>> Streaming
>>> > framework
>>> > SPARK-9930  New feature transformers - ChiSqSelector,
>>> QuantileDiscretizer,
>>> > SQL transformer
>>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>>> variant
>>> > of K-Means
>>> >
>>> > API improvements
>>> >
>>> > ML Pipelines
>>> >
>>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>>> partial
>>> > coverage of spark.ml algorithms
>>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
>>> in ML
>>> > Pipelines
>>> >
>>> > R API
>>> >
>>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>>> ordinary
>>> > least squares via summary(model)
>>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>>> ":" in
>>> > R formula
>>> >
>>> > Python API - Many improvements to Python API to approach feature parity
>>> >
>>> > Misc improvements
>>> >
>>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
>>> Linear
>>> > Regression can take instance weights
>>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>>> DataFrames -
>>> > Variance, stddev, correlations, etc.
>>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>>> >
>>> > Documentation improvements
>>> >
>>> > SPARK-7751  @since versions - Documentation includes initial version
>>> when
>>> > classes and methods were added
>>> > SPARK-11337 Testable example code - Automated testing for code in user
>>> guide
>>> > examples
>>> >
>>> > Deprecations
>>> >
>>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>> deprecated.
>>> > In spark.ml.classification.LogisticRegressionModel and
>>> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>> > deprecated, in favor of the new name "coefficients." This helps
>>> disambiguate
>>> > from instance (row) weights given to algorithms.
>>> >
>>> > Changes of behavior
>>> >
>>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>> semantics in
>>> > 1.6. Previously, it was a threshold for absolute change in error. Now,
>>> it
>>> > resembles the behavior of GradientDescent convergenceTol: For large
>>> errors,
>>> > it uses relative error (relative to the previous error); for small
>>> errors (<
>>> > 0.01), it uses absolute error.
>>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>> strings to
>>> > lowercase before tokenizing. Now, it converts to lowercase by default,
>>> with
>>> > an option not to. This matches the behavior of the simpler Tokenizer
>>> > transformer.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Yin Huai <yh...@databricks.com>.

-1

Tow blocker bugs have been found after this RC.
https://issues.apache.org/jira/browse/SPARK-12089 can cause data corruption
when an external sorter spills data.
https://issues.apache.org/jira/browse/SPARK-12155 can prevent tasks from
acquiring memory even when the executor indeed can allocate memory by
evicting storage memory.

https://issues.apache.org/jira/browse/SPARK-12089 has been fixed. We are
still working on https://issues.apache.org/jira/browse/SPARK-12155.

On Fri, Dec 4, 2015 at 3:04 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> 0
>
> Currently figuring out who is responsible for the regression that I am
> seeing in some user code ScalaUDFs that make use of Timestamps and where
> NULL from a CSV file read in via a TestHive#registerTestTable is now
> producing 1969-12-31 23:59:59.999999 instead of null.
>
> On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Licenses and signature are all fine.
>>
>> Docker integration tests consistently fail for me with Java 7 / Ubuntu
>> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>>
>> *** RUN ABORTED ***
>>   java.lang.NoSuchMethodError:
>>
>> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>>   at
>> org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
>>   at
>> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>   at
>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>   at
>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>   at
>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>   at
>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>
>> I also get this failure consistently:
>>
>> DirectKafkaStreamSuite
>> - offset recovery *** FAILED ***
>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>
>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>
>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>
>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
>> was false Recovered ranges are not the same as the ones generated
>> (DirectKafkaStreamSuite.scala:301)
>>
>> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
>> passes if
>> > a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc1
>> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1165/
>> >
>> > The test repository (versioned as v1.6.0-rc1) for this release can be
>> found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1164/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>> >
>> >
>> > =======================================
>> > == How can I help test this release? ==
>> > =======================================
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > ================================================
>> > == What justifies a -1 vote for this release? ==
>> > ================================================
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> present
>> > in 1.5, minor regressions, or bugs related to new features will not
>> block
>> > this release.
>> >
>> > ===============================================================
>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>> > ===============================================================
>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>> > branch-1.6, since documentations will be published separately from the
>> > release.
>> > 2. New features for non-alpha-modules should target 1.7+.
>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> target
>> > version.
>> >
>> >
>> > ==================================================
>> > == Major changes to help you focus your testing ==
>> > ==================================================
>> >
>> > Spark SQL
>> >
>> > SPARK-10810 Session Management - The ability to create multiple
>> isolated SQL
>> > Contexts that have their own configuration and default database.  This
>> is
>> > turned on by default in the thrift server.
>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>> performs
>> > many operations on serialized binary data and code generation (i.e.
>> Project
>> > Tungsten).
>> > SPARK-10000 Unified Memory Management - Shared memory for execution and
>> > caching instead of exclusive division of the regions.
>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>> queries
>> > over files of any supported format without registering a table.
>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>> > SPARK-10412 Per-operator Metics for SQL Execution - Display statistics
>> on a
>> > per-operator basis for memory usage and spilled data size.
>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>> nest and
>> > unest arbitrary numbers of columns
>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>> Significant
>> > (up to 14x) speed up when caching data that contains complex types in
>> > DataFrames or SQL.
>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
>> will
>> > now execute using SortMergeJoin instead of computing a cartisian
>> product.
>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>> configuring
>> > query execution to occur using off-heap memory to avoid GC overhead
>> > SPARK-10978 Datasource API Avoid Double Filter - When implementing a
>> > datasource with filter pushdown, developers can now tell Spark SQL to
>> avoid
>> > double evaluating a pushed-down filter.
>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>> > ordering schemes in In-memory table scan, and adding distributeBy and
>> > localSort to DF API
>> > SPARK-9858  Adaptive query execution - Initial support for automatically
>> > selecting the number of reducers for joins and aggregations.
>> >
>> > Spark Streaming
>> >
>> > API Updates
>> >
>> > SPARK-2629  New improved state management - trackStateByKey - a DStream
>> > transformation for stateful stream processing, supersedes
>> updateStateByKey
>> > in functionality and performance.
>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>> > KPL-aggregated records.
>> > SPARK-10891 Kinesis message handler function - Allows arbitrary
>> function to
>> > be applied to a Kinesis record in the Kinesis receiver before to
>> customize
>> > what data is to be stored in memory.
>> > SPARK-6328  Python Streaming Listener API - Get streaming statistics
>> > (scheduling delays, batch processing times, etc.) in streaming.
>> >
>> > UI Improvements
>> >
>> > Made failures visible in the streaming tab, in the timelines, batch
>> list,
>> > and batch details page.
>> > Made output operations visible in the streaming tab as progress bars
>> >
>> > MLlib
>> >
>> > New algorithms/models
>> >
>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>> > SPARK-9834  Normal equation for least squares - Normal equation solver,
>> > providing R-like model summary statistics
>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>> Streaming
>> > framework
>> > SPARK-9930  New feature transformers - ChiSqSelector,
>> QuantileDiscretizer,
>> > SQL transformer
>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>> variant
>> > of K-Means
>> >
>> > API improvements
>> >
>> > ML Pipelines
>> >
>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>> partial
>> > coverage of spark.ml algorithms
>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
>> in ML
>> > Pipelines
>> >
>> > R API
>> >
>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>> ordinary
>> > least squares via summary(model)
>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>> ":" in
>> > R formula
>> >
>> > Python API - Many improvements to Python API to approach feature parity
>> >
>> > Misc improvements
>> >
>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
>> > Regression can take instance weights
>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>> DataFrames -
>> > Variance, stddev, correlations, etc.
>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>> >
>> > Documentation improvements
>> >
>> > SPARK-7751  @since versions - Documentation includes initial version
>> when
>> > classes and methods were added
>> > SPARK-11337 Testable example code - Automated testing for code in user
>> guide
>> > examples
>> >
>> > Deprecations
>> >
>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> deprecated.
>> > In spark.ml.classification.LogisticRegressionModel and
>> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> > deprecated, in favor of the new name "coefficients." This helps
>> disambiguate
>> > from instance (row) weights given to algorithms.
>> >
>> > Changes of behavior
>> >
>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> semantics in
>> > 1.6. Previously, it was a threshold for absolute change in error. Now,
>> it
>> > resembles the behavior of GradientDescent convergenceTol: For large
>> errors,
>> > it uses relative error (relative to the previous error); for small
>> errors (<
>> > 0.01), it uses absolute error.
>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
>> to
>> > lowercase before tokenizing. Now, it converts to lowercase by default,
>> with
>> > an option not to. This matches the behavior of the simpler Tokenizer
>> > transformer.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Mark Hamstra <ma...@clearstorydata.com>.

0

Currently figuring out who is responsible for the regression that I am
seeing in some user code ScalaUDFs that make use of Timestamps and where
NULL from a CSV file read in via a TestHive#registerTestTable is now
producing 1969-12-31 23:59:59.999999 instead of null.

On Thu, Dec 3, 2015 at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:

> Licenses and signature are all fine.
>
> Docker integration tests consistently fail for me with Java 7 / Ubuntu
> and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"
>
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError:
>
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at
> org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
>   at
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>
> I also get this failure consistently:
>
> DirectKafkaStreamSuite
> - offset recovery *** FAILED ***
>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>
> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>
> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>
> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
> was false Recovered ranges are not the same as the ones generated
> (DirectKafkaStreamSuite.scala:301)
>
> On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Saturday, December 5, 2015 at 21:00 UTC and
> passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc1
> > (bf525845cef159d2d4c9f4d64e158f037179b5c4)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1165/
> >
> > The test repository (versioned as v1.6.0-rc1) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1164/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
> >
> >
> > =======================================
> > == How can I help test this release? ==
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ================================================
> > == What justifies a -1 vote for this release? ==
> > ================================================
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===============================================================
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > == Major changes to help you focus your testing ==
> > ==================================================
> >
> > Spark SQL
> >
> > SPARK-10810 Session Management - The ability to create multiple isolated
> SQL
> > Contexts that have their own configuration and default database.  This is
> > turned on by default in the thrift server.
> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-10000 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard JSON files - Added options to read
> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> > SPARK-10412 Per-operator Metics for SQL Execution - Display statistics
> on a
> > per-operator basis for memory usage and spilled data size.
> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
> and
> > unest arbitrary numbers of columns
> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> Significant
> > (up to 14x) speed up when caching data that contains complex types in
> > DataFrames or SQL.
> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
> will
> > now execute using SortMergeJoin instead of computing a cartisian product.
> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> > query execution to occur using off-heap memory to avoid GC overhead
> > SPARK-10978 Datasource API Avoid Double Filter - When implementing a
> > datasource with filter pushdown, developers can now tell Spark SQL to
> avoid
> > double evaluating a pushed-down filter.
> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> > ordering schemes in In-memory table scan, and adding distributeBy and
> > localSort to DF API
> > SPARK-9858  Adaptive query execution - Initial support for automatically
> > selecting the number of reducers for joins and aggregations.
> >
> > Spark Streaming
> >
> > API Updates
> >
> > SPARK-2629  New improved state management - trackStateByKey - a DStream
> > transformation for stateful stream processing, supersedes
> updateStateByKey
> > in functionality and performance.
> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> > KPL-aggregated records.
> > SPARK-10891 Kinesis message handler function - Allows arbitrary function
> to
> > be applied to a Kinesis record in the Kinesis receiver before to
> customize
> > what data is to be stored in memory.
> > SPARK-6328  Python Streaming Listener API - Get streaming statistics
> > (scheduling delays, batch processing times, etc.) in streaming.
> >
> > UI Improvements
> >
> > Made failures visible in the streaming tab, in the timelines, batch list,
> > and batch details page.
> > Made output operations visible in the streaming tab as progress bars
> >
> > MLlib
> >
> > New algorithms/models
> >
> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
> > SPARK-9834  Normal equation for least squares - Normal equation solver,
> > providing R-like model summary statistics
> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
> Streaming
> > framework
> > SPARK-9930  New feature transformers - ChiSqSelector,
> QuantileDiscretizer,
> > SQL transformer
> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
> variant
> > of K-Means
> >
> > API improvements
> >
> > ML Pipelines
> >
> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
> partial
> > coverage of spark.ml algorithms
> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML
> > Pipelines
> >
> > R API
> >
> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
> ordinary
> > least squares via summary(model)
> > SPARK-9681  Feature interactions in R formula - Interaction operator ":"
> in
> > R formula
> >
> > Python API - Many improvements to Python API to approach feature parity
> >
> > Misc improvements
> >
> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> > Regression can take instance weights
> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> DataFrames -
> > Variance, stddev, correlations, etc.
> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >
> > Documentation improvements
> >
> > SPARK-7751  @since versions - Documentation includes initial version when
> > classes and methods were added
> > SPARK-11337 Testable example code - Automated testing for code in user
> guide
> > examples
> >
> > Deprecations
> >
> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> > In spark.ml.classification.LogisticRegressionModel and
> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
> > deprecated, in favor of the new name "coefficients." This helps
> disambiguate
> > from instance (row) weights given to algorithms.
> >
> > Changes of behavior
> >
> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in
> > 1.6. Previously, it was a threshold for absolute change in error. Now, it
> > resembles the behavior of GradientDescent convergenceTol: For large
> errors,
> > it uses relative error (relative to the previous error); for small
> errors (<
> > 0.01), it uses absolute error.
> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
> to
> > lowercase before tokenizing. Now, it converts to lowercase by default,
> with
> > an option not to. This matches the behavior of the simpler Tokenizer
> > transformer.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Sean Owen <so...@cloudera.com>.

Licenses and signature are all fine.

Docker integration tests consistently fail for me with Java 7 / Ubuntu
and "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver"

*** RUN ABORTED ***
  java.lang.NoSuchMethodError:
org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
  at org.glassfish.jersey.apache.connector.ApacheConnector.<init>(ApacheConnector.java:240)
  at org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
  at org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
  at org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
  at org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
  at org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
  at org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
  at org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
  at org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
  at org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)

I also get this failure consistently:

DirectKafkaStreamSuite
- offset recovery *** FAILED ***
  recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]))))
was false Recovered ranges are not the same as the ones generated
(DirectKafkaStreamSuite.scala:301)

On Wed, Dec 2, 2015 at 8:26 PM, Michael Armbrust <mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc1
> (bf525845cef159d2d4c9f4d64e158f037179b5c4)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1165/
>
> The test repository (versioned as v1.6.0-rc1) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1164/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark SQL
>
> SPARK-10810 Session Management - The ability to create multiple isolated SQL
> Contexts that have their own configuration and default database.  This is
> turned on by default in the thrift server.
> SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metics for SQL Execution - Display statistics on a
> per-operator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implementing a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858  Adaptive query execution - Initial support for automatically
> selecting the number of reducers for joins and aggregations.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629  New improved state management - trackStateByKey - a DStream
> transformation for stateful stream processing, supersedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitrary function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328  Python Streaming Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518  Survival analysis - Log-linear model for survival analysis
> SPARK-9834  Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.ml algorithms
> SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681  Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751  @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by Davies Liu <da...@databricks.com>.

Does this https://github.com/apache/spark/pull/10134 is valid fix?
(still worse than 1.5)

On Thu, Dec 3, 2015 at 8:45 AM, mkhaitman <ma...@chango.com> wrote:
> I reported this in the 1.6 preview thread, but wouldn't mind if someone can
> confirm that ctrl-c is not keyboard interrupting / clearing the current line
> of input anymore in the pyspark shell. I saw the change that would kill the
> currently running job when using ctrl+c, but now the only way to clear the
> current line of input is to simply hit enter (throwing an exception). Anyone
> else seeing this?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC1-tp15424p15450.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by mkhaitman <ma...@chango.com>.

I reported this in the 1.6 preview thread, but wouldn't mind if someone can
confirm that ctrl-c is not keyboard interrupting / clearing the current line
of input anymore in the pyspark shell. I saw the change that would kill the
currently running job when using ctrl+c, but now the only way to clear the
current line of input is to simply hit enter (throwing an exception). Anyone
else seeing this?





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC1-tp15424p15450.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by dodobidu <ri...@actnowib.com>.

+1 (non binding) 

Tested our pipelines on a Spark 1.6.0 standalone cluster (Python only):
- Pyspark package
- Spark SQL
- Dataframes
- Spark MLlib

No major issues, good performance.


Just a minor distinct behavior from version 1.4.1 using a SQLContext:
"select case myColumn when null then 'Y' else 'N' end" doesn't match any
NULL values in 1.6.0. Had to change it to:
"select case when (myColumn is null) then 'Y' else 'N' end"


Otherwise... good work guys!

Ricardo Almeida



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC1-tp15424p15451.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

Posted by taishi takahashi <ta...@gmail.com>.

Excuse me,

I'm working on SPARK-10259.(this parent issue is SPARK-7751.)
https://issues.apache.org/jira/browse/SPARK-10259

This issues's purpose is to add @Since annotation to stable and experimenal
methods in MLlib.

in SPARK-7751, this and this children issues' target version is v.1.6.0,
but some issue is in progress.
(One of the reasons is delay of my work. For the present, I'm waiting for
Jenkins's test.)

if these issues is merged in v.1.6.0, please run Jenkins's test for
SPARK-10259.

Thanks,
Hiroshi Takahashi

2015-12-03 11:13 GMT+09:00 Ted Yu <yu...@gmail.com>:

> +1
>
> Ran through test suite (minus docker-integration-tests) which passed.
>
> Overall experience was much better compared with some of the prior RC's.
>
> [INFO] Spark Project External Kafka ....................... SUCCESS [
> 53.956 s]
> [INFO] Spark Project Examples ............................. SUCCESS [02:05
> min]
> [INFO] Spark Project External Kafka Assembly .............. SUCCESS [
> 11.298 s]
> [INFO]
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 01:42 h
> [INFO] Finished at: 2015-12-02T17:19:02-08:00
>
> On Wed, Dec 2, 2015 at 4:23 PM, Michael Armbrust <mi...@databricks.com>
>  wrote:
>
>> I'm going to kick the voting off with a +1 (binding).  We ran TPC-DS and
>> most queries are faster than 1.5.  We've also ported several production
>> pipelines to 1.6.
>>
>
>
2015-12-03 5:26 GMT+09:00 Michael Armbrust <mi...@databricks.com>:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc1
> (bf525845cef159d2d4c9f4d64e158f037179b5c4)
> <https://github.com/apache/spark/tree/v1.6.0-rc1>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1165/
>
> The test repository (versioned as v1.6.0-rc1) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1164/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark SQL
>
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - The ability to create multiple isolated SQL
>    Contexts that have their own configuration and default database.  This is
>    turned on by default in the thrift server.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metics for SQL Execution - Display statistics on a per-operator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implementing a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Initial support for automatically selecting the
>    number of reducers for joins and aggregations.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - trackStateByKey - a DStream
>       transformation for stateful stream processing, supersedes
>       updateStateByKey in functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitrary function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>        Python Streaming Listener API - Get streaming statistics
>       (scheduling delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>
>