You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2015/12/22 21:10:46 UTC

[VOTE] Release Apache Spark 1.6.0 (RC4)

Please vote on releasing the following candidate as Apache Spark version
1.6.0!

The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is *v1.6.0-rc4
(4062cda3087ae42c6c3cb24508fc1d3a931accdf)
<https://github.com/apache/spark/tree/v1.6.0-rc4>*

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1176/

The test repository (versioned as v1.6.0-rc4) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1175/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/

=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.

===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.


==================================================
== Major changes to help you focus your testing ==
==================================================

Notable changes since 1.6 RC3

  - SPARK-12404 - Fix serialization error for Datasets with
Timestamps/Arrays/Decimal
  - SPARK-12218 - Fix incorrect pushdown of filters to parquet
  - SPARK-12395 - Fix join columns of outer join for DataFrame using
  - SPARK-12413 - Fix mesos HA

Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed

Notable changes since 1.6 RC1
Spark Streaming

   - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
   trackStateByKey has been renamed to mapWithState

Spark SQL

   - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
   SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs
   in eviction of storage memory by execution.
   - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
   passing null into ScalaUDF

Notable Features Since 1.5Spark SQL

   - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
   Performance - Improve Parquet scan performance when using flat schemas.
   - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
   Session Management - Isolated devault database (i.e USE mydb) even on
   shared clusters.
   - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
   API - A type-safe API (similar to RDDs) that performs many operations on
   serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
   Memory Management - Shared memory for execution and caching instead of
   exclusive division of the regions.
   - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
   Queries on Files - Concise syntax for running SQL queries over files of
   any supported format without registering a table.
   - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
   non-standard JSON files - Added options to read non-standard JSON files
   (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator
   Metrics for SQL Execution - Display statistics on a peroperator basis
   for memory usage and spilled data size.
   - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
   (*) expansion for StructTypes - Makes it easier to nest and unest
   arbitrary numbers of columns
   - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
   SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
   Columnar Cache Performance - Significant (up to 14x) speed up when
   caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
   null-safe joins - Joins using null-safe equality (<=>) will now execute
   using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
   Execution Using Off-Heap Memory - Support for configuring query
   execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
   API Avoid Double Filter - When implemeting a datasource with filter
   pushdown, developers can now tell Spark SQL to avoid double evaluating a
   pushed-down filter.
   - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
   Layout of Cached Data - storing partitioning and ordering schemes in
   In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
   query execution - Intial support for automatically selecting the number
   of reducers for joins and aggregations.
   - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
   query planner for queries having distinct aggregations - Query plans of
   distinct aggregations are more robust when distinct columns have high
   cardinality.

Spark Streaming

   - API Updates
      - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
      improved state management - mapWithState - a DStream transformation
      for stateful stream processing, supercedes updateStateByKey in
      functionality and performance.
      - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
      record deaggregation - Kinesis streams have been upgraded to use KCL
      1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
      message handler function - Allows arbitraray function to be applied
      to a Kinesis record in the Kinesis receiver before to customize what data
      is to be stored in memory.
      - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
      Streamng Listener API - Get streaming statistics (scheduling delays,
      batch processing times, etc.) in streaming.


   - UI Improvements
      - Made failures visible in the streaming tab, in the timelines, batch
      list, and batch details page.
      - Made output operations visible in the streaming tab as progress
      bars.

MLlibNew algorithms/models

   - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
   analysis - Log-linear model for survival analysis
   - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
   equation for least squares - Normal equation solver, providing R-like
   model summary statistics
   - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
   hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
   feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
   transformer
   - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
   K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

   - ML Pipelines
      - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
      persistence - Save/load for ML Pipelines, with partial coverage of
      spark.mlalgorithms
      - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
      in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
   - R API
      - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
      statistics for GLMs - (Partial) R-like stats for ordinary least
      squares via summary(model)
      - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
      interactions in R formula - Interaction operator ":" in R formula
   - Python API - Many improvements to Python API to approach feature parity

Misc improvements

   - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
   SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
   weights for GLMs - Logistic and Linear Regression can take instance
   weights
   - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
   SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
   and bivariate statistics in DataFrames - Variance, stddev, correlations,
   etc.
   - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
   data source - LIBSVM as a SQL data sourceDocumentation improvements
   - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
   versions - Documentation includes initial version when classes and
   methods were added
   - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
   example code - Automated testing for code in user guide examples

Deprecations

   - In spark.mllib.clustering.KMeans, the "runs" parameter has been
   deprecated.
   - In spark.ml.classification.LogisticRegressionModel and
   spark.ml.regression.LinearRegressionModel, the "weights" field has been
   deprecated, in favor of the new name "coefficients." This helps
   disambiguate from instance (row) weights given to algorithms.

Changes of behavior

   - spark.mllib.tree.GradientBoostedTrees validationTol has changed
   semantics in 1.6. Previously, it was a threshold for absolute change in
   error. Now, it resembles the behavior of GradientDescent convergenceTol:
   For large errors, it uses relative error (relative to the previous error);
   for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert
   strings to lowercase before tokenizing. Now, it converts to lowercase by
   default, with an option not to. This matches the behavior of the simpler
   Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover
   partition directories that are children of the given path. (i.e. if
   path="/my/data/x=1" then x=1 will no longer be considered a partition
   but only children of x=1.) This behavior can be overridden by manually
   specifying the basePath that partitioning discovery should start with (
   SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
   - When casting a value of an integral type to timestamp (e.g. casting a
   long value to timestamp), the value is treated as being in seconds instead
   of milliseconds (SPARK-11724
   <https://issues.apache.org/jira/browse/SPARK-11724>).
   - With the improved query planner for queries having distinct
   aggregations (SPARK-9241
   <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a query
   having a single distinct aggregation has been changed to a more robust
   version. To switch back to the plan generated by Spark 1.5's planner,
   please set spark.sql.specializeSingleDistinctAggPlanning to true (
   SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Zsolt Tóth <to...@gmail.com>.

+1 (non binding)

(Pyspark K-Means still shows the numeric diff, of course.)

2015-12-23 9:33 GMT+01:00 Kousuke Saruta <sa...@oss.nttdata.co.jp>:

> +1
>
>
> On 2015/12/23 16:14, Jean-Baptiste Onofré wrote:
>
>> +1 (non binding)
>>
>> Tested with samples on standalone and yarn.
>>
>> Regards
>> JB
>>
>> On 12/22/2015 09:10 PM, Michael Armbrust wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is _v1.6.0-rc4
>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>_
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>
>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will
>>> not block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>>
>>>   Notable changes since 1.6 RC3
>>>
>>>
>>>    - SPARK-12404 - Fix serialization error for Datasets with
>>> Timestamps/Arrays/Decimal
>>>    - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>    - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>    - SPARK-12413 - Fix mesos HA
>>>
>>>
>>>
>>>   Notable changes since 1.6 RC2
>>>
>>>
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>>
>>>   Notable changes since 1.6 RC1
>>>
>>>
>>>       Spark Streaming
>>>
>>>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>>     |trackStateByKey| has been renamed to |mapWithState|
>>>
>>>
>>>       Spark SQL
>>>
>>>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>     SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>     bugs in eviction of storage memory by execution.
>>>   * SPARK-12258
>>>     <https://issues.apache.org/jira/browse/SPARK-12258> correct passing
>>>     null into ScalaUDF
>>>
>>>
>>>     Notable Features Since 1.5
>>>
>>>
>>>       Spark SQL
>>>
>>>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>>>     Parquet Performance - Improve Parquet scan performance when using
>>>     flat schemas.
>>>   * SPARK-10810
>>> <https://issues.apache.org/jira/browse/SPARK-10810>Session
>>>     Management - Isolated devault database (i.e |USE mydb|) even on
>>>     shared clusters.
>>>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>>>     Dataset API - A type-safe API (similar to RDDs) that performs many
>>>     operations on serialized binary data and code generation (i.e.
>>>     Project Tungsten).
>>>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>>>     Unified Memory Management - Shared memory for execution and caching
>>>     instead of exclusive division of the regions.
>>>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>     Queries on Files - Concise syntax for running SQL queries over files
>>>     of any supported format without registering a table.
>>>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>>>     Reading non-standard JSON files - Added options to read non-standard
>>>     JSON files (e.g. single-quotes, unquoted attributes)
>>>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>>>     Per-operator Metrics for SQL Execution - Display statistics on a
>>>     peroperator basis for memory usage and spilled data size.
>>>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>     (*) expansion for StructTypes - Makes it easier to nest and unest
>>>     arbitrary numbers of columns
>>>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>>>     In-memory Columnar Cache Performance - Significant (up to 14x) speed
>>>     up when caching data that contains complex types in DataFrames or
>>> SQL.
>>>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>     null-safe joins - Joins using null-safe equality (|<=>|) will now
>>>     execute using SortMergeJoin instead of computing a cartisian product.
>>>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>     Execution Using Off-Heap Memory - Support for configuring query
>>>     execution to occur using off-heap memory to avoid GC overhead
>>>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>>>     Datasource API Avoid Double Filter - When implemeting a datasource
>>>     with filter pushdown, developers can now tell Spark SQL to avoid
>>>     double evaluating a pushed-down filter.
>>>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>>>     Advanced Layout of Cached Data - storing partitioning and ordering
>>>     schemes in In-memory table scan, and adding distributeBy and
>>>     localSort to DF API
>>>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>>>     Adaptive query execution - Intial support for automatically
>>>     selecting the number of reducers for joins and aggregations.
>>>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>>>     Improved query planner for queries having distinct aggregations -
>>>     Query plans of distinct aggregations are more robust when distinct
>>>     columns have high cardinality.
>>>
>>>
>>>       Spark Streaming
>>>
>>>   * API Updates
>>>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>>         New improved state management - |mapWithState| - a DStream
>>>         transformation for stateful stream processing, supercedes
>>>         |updateStateByKey| in functionality and performance.
>>>       o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>         Kinesis record deaggregation - Kinesis streams have been
>>>         upgraded to use KCL 1.4.0 and supports transparent deaggregation
>>>         of KPL-aggregated records.
>>>       o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>         Kinesis message handler function - Allows arbitraray function to
>>>         be applied to a Kinesis record in the Kinesis receiver before to
>>>         customize what data is to be stored in memory.
>>>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>>>         Python Streamng Listener API - Get streaming statistics
>>>         (scheduling delays, batch processing times, etc.) in streaming.
>>>
>>>   * UI Improvements
>>>       o Made failures visible in the streaming tab, in the timelines,
>>>         batch list, and batch details page.
>>>       o Made output operations visible in the streaming tab as progress
>>>         bars.
>>>
>>>
>>>       MLlib
>>>
>>>
>>>         New algorithms/models
>>>
>>>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>>>     Survival analysis - Log-linear model for survival analysis
>>>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>     equation for least squares - Normal equation solver, providing
>>>     R-like model summary statistics
>>>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>     hypothesis testing - A/B testing in the Spark Streaming framework
>>>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>     transformer
>>>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>>>     Bisecting K-Means clustering - Fast top-down clustering variant of
>>>     K-Means
>>>
>>>
>>>         API improvements
>>>
>>>   * ML Pipelines
>>>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>>>         Pipeline persistence - Save/load for ML Pipelines, with partial
>>>         coverage of spark.ml <http://spark.ml/>algorithms
>>>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>>>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>         Pipelines
>>>   * R API
>>>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>>>         R-like statistics for GLMs - (Partial) R-like stats for ordinary
>>>         least squares via summary(model)
>>>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>>>         Feature interactions in R formula - Interaction operator ":" in
>>>         R formula
>>>   * Python API - Many improvements to Python API to approach feature
>>> parity
>>>
>>>
>>>         Misc improvements
>>>
>>>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>>>     Instance weights for GLMs - Logistic and Linear Regression can take
>>>     instance weights
>>>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>>>     Univariate and bivariate statistics in DataFrames - Variance,
>>>     stddev, correlations, etc.
>>>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>>>     LIBSVM data source - LIBSVM as a SQL data source
>>>
>>>
>>>             Documentation improvements
>>>
>>>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>     versions - Documentation includes initial version when classes and
>>>     methods were added
>>>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>>>     Testable example code - Automated testing for code in user guide
>>>     examples
>>>
>>>
>>>     Deprecations
>>>
>>>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>     deprecated.
>>>   * In spark.ml.classification.LogisticRegressionModel and
>>>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>>>     been deprecated, in favor of the new name "coefficients." This helps
>>>     disambiguate from instance (row) weights given to algorithms.
>>>
>>>
>>>     Changes of behavior
>>>
>>>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>     semantics in 1.6. Previously, it was a threshold for absolute change
>>>     in error. Now, it resembles the behavior of GradientDescent
>>>     convergenceTol: For large errors, it uses relative error (relative
>>>     to the previous error); for small errors (< 0.01), it uses absolute
>>>     error.
>>>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>     strings to lowercase before tokenizing. Now, it converts to
>>>     lowercase by default, with an option not to. This matches the
>>>     behavior of the simpler Tokenizer transformer.
>>>   * Spark SQL's partition discovery has been changed to only discover
>>>     partition directories that are children of the given path. (i.e. if
>>>     |path="/my/data/x=1"| then |x=1| will no longer be considered a
>>>     partition but only children of |x=1|.) This behavior can be
>>>     overridden by manually specifying the |basePath| that partitioning
>>>     discovery should start with (SPARK-11678
>>>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>   * When casting a value of an integral type to timestamp (e.g. casting
>>>     a long value to timestamp), the value is treated as being in seconds
>>>     instead of milliseconds (SPARK-11724
>>>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>   * With the improved query planner for queries having distinct
>>>     aggregations (SPARK-9241
>>>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>     query having a single distinct aggregation has been changed to a
>>>     more robust version. To switch back to the plan generated by Spark
>>>     1.5's planner, please set
>>>     |spark.sql.specializeSingleDistinctAggPlanning| to
>>>     |true| (SPARK-12077
>>>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Kousuke Saruta <sa...@oss.nttdata.co.jp>.

+1

On 2015/12/23 16:14, Jean-Baptiste Onofré wrote:
> +1 (non binding)
>
> Tested with samples on standalone and yarn.
>
> Regards
> JB
>
> On 12/22/2015 09:10 PM, Michael Armbrust wrote:
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is _v1.6.0-rc4
>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>> <https://github.com/apache/spark/tree/v1.6.0-rc4>_
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>
>> The test repository (versioned as v1.6.0-rc4) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will
>> not block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> target version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>>
>>   Notable changes since 1.6 RC3
>>
>>
>>    - SPARK-12404 - Fix serialization error for Datasets with
>> Timestamps/Arrays/Decimal
>>    - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>    - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>    - SPARK-12413 - Fix mesos HA
>>
>>
>>
>>   Notable changes since 1.6 RC2
>>
>>
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>>
>>   Notable changes since 1.6 RC1
>>
>>
>>       Spark Streaming
>>
>>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>     |trackStateByKey| has been renamed to |mapWithState|
>>
>>
>>       Spark SQL
>>
>>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>     SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>     bugs in eviction of storage memory by execution.
>>   * SPARK-12258
>>     <https://issues.apache.org/jira/browse/SPARK-12258> correct passing
>>     null into ScalaUDF
>>
>>
>>     Notable Features Since 1.5
>>
>>
>>       Spark SQL
>>
>>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>>     Parquet Performance - Improve Parquet scan performance when using
>>     flat schemas.
>>   * SPARK-10810
>> <https://issues.apache.org/jira/browse/SPARK-10810>Session
>>     Management - Isolated devault database (i.e |USE mydb|) even on
>>     shared clusters.
>>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>>     Dataset API - A type-safe API (similar to RDDs) that performs many
>>     operations on serialized binary data and code generation (i.e.
>>     Project Tungsten).
>>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>>     Unified Memory Management - Shared memory for execution and caching
>>     instead of exclusive division of the regions.
>>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>     Queries on Files - Concise syntax for running SQL queries over files
>>     of any supported format without registering a table.
>>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>>     Reading non-standard JSON files - Added options to read non-standard
>>     JSON files (e.g. single-quotes, unquoted attributes)
>>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>>     Per-operator Metrics for SQL Execution - Display statistics on a
>>     peroperator basis for memory usage and spilled data size.
>>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>     (*) expansion for StructTypes - Makes it easier to nest and unest
>>     arbitrary numbers of columns
>>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>>     In-memory Columnar Cache Performance - Significant (up to 14x) speed
>>     up when caching data that contains complex types in DataFrames or 
>> SQL.
>>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>     null-safe joins - Joins using null-safe equality (|<=>|) will now
>>     execute using SortMergeJoin instead of computing a cartisian 
>> product.
>>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>     Execution Using Off-Heap Memory - Support for configuring query
>>     execution to occur using off-heap memory to avoid GC overhead
>>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>>     Datasource API Avoid Double Filter - When implemeting a datasource
>>     with filter pushdown, developers can now tell Spark SQL to avoid
>>     double evaluating a pushed-down filter.
>>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>>     Advanced Layout of Cached Data - storing partitioning and ordering
>>     schemes in In-memory table scan, and adding distributeBy and
>>     localSort to DF API
>>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>>     Adaptive query execution - Intial support for automatically
>>     selecting the number of reducers for joins and aggregations.
>>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>>     Improved query planner for queries having distinct aggregations -
>>     Query plans of distinct aggregations are more robust when distinct
>>     columns have high cardinality.
>>
>>
>>       Spark Streaming
>>
>>   * API Updates
>>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>         New improved state management - |mapWithState| - a DStream
>>         transformation for stateful stream processing, supercedes
>>         |updateStateByKey| in functionality and performance.
>>       o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>         Kinesis record deaggregation - Kinesis streams have been
>>         upgraded to use KCL 1.4.0 and supports transparent deaggregation
>>         of KPL-aggregated records.
>>       o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>         Kinesis message handler function - Allows arbitraray function to
>>         be applied to a Kinesis record in the Kinesis receiver before to
>>         customize what data is to be stored in memory.
>>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>>         Python Streamng Listener API - Get streaming statistics
>>         (scheduling delays, batch processing times, etc.) in streaming.
>>
>>   * UI Improvements
>>       o Made failures visible in the streaming tab, in the timelines,
>>         batch list, and batch details page.
>>       o Made output operations visible in the streaming tab as progress
>>         bars.
>>
>>
>>       MLlib
>>
>>
>>         New algorithms/models
>>
>>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>>     Survival analysis - Log-linear model for survival analysis
>>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>     equation for least squares - Normal equation solver, providing
>>     R-like model summary statistics
>>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>     hypothesis testing - A/B testing in the Spark Streaming framework
>>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>     transformer
>>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>>     Bisecting K-Means clustering - Fast top-down clustering variant of
>>     K-Means
>>
>>
>>         API improvements
>>
>>   * ML Pipelines
>>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>>         Pipeline persistence - Save/load for ML Pipelines, with partial
>>         coverage of spark.ml <http://spark.ml/>algorithms
>>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>         Pipelines
>>   * R API
>>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>>         R-like statistics for GLMs - (Partial) R-like stats for ordinary
>>         least squares via summary(model)
>>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>>         Feature interactions in R formula - Interaction operator ":" in
>>         R formula
>>   * Python API - Many improvements to Python API to approach feature 
>> parity
>>
>>
>>         Misc improvements
>>
>>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>>     Instance weights for GLMs - Logistic and Linear Regression can take
>>     instance weights
>>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>>     Univariate and bivariate statistics in DataFrames - Variance,
>>     stddev, correlations, etc.
>>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>>     LIBSVM data source - LIBSVM as a SQL data source
>>
>>
>>             Documentation improvements
>>
>>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>     versions - Documentation includes initial version when classes and
>>     methods were added
>>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>>     Testable example code - Automated testing for code in user guide
>>     examples
>>
>>
>>     Deprecations
>>
>>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>     deprecated.
>>   * In spark.ml.classification.LogisticRegressionModel and
>>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>>     been deprecated, in favor of the new name "coefficients." This helps
>>     disambiguate from instance (row) weights given to algorithms.
>>
>>
>>     Changes of behavior
>>
>>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>     semantics in 1.6. Previously, it was a threshold for absolute change
>>     in error. Now, it resembles the behavior of GradientDescent
>>     convergenceTol: For large errors, it uses relative error (relative
>>     to the previous error); for small errors (< 0.01), it uses absolute
>>     error.
>>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>     strings to lowercase before tokenizing. Now, it converts to
>>     lowercase by default, with an option not to. This matches the
>>     behavior of the simpler Tokenizer transformer.
>>   * Spark SQL's partition discovery has been changed to only discover
>>     partition directories that are children of the given path. (i.e. if
>>     |path="/my/data/x=1"| then |x=1| will no longer be considered a
>>     partition but only children of |x=1|.) This behavior can be
>>     overridden by manually specifying the |basePath| that partitioning
>>     discovery should start with (SPARK-11678
>>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>>   * When casting a value of an integral type to timestamp (e.g. casting
>>     a long value to timestamp), the value is treated as being in seconds
>>     instead of milliseconds (SPARK-11724
>>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>>   * With the improved query planner for queries having distinct
>>     aggregations (SPARK-9241
>>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>     query having a single distinct aggregation has been changed to a
>>     more robust version. To switch back to the plan generated by Spark
>>     1.5's planner, please set
>>     |spark.sql.specializeSingleDistinctAggPlanning| to
>>     |true| (SPARK-12077
>>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

+1 (non binding)

Tested with samples on standalone and yarn.

Regards
JB

On 12/22/2015 09:10 PM, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> <https://github.com/apache/spark/tree/v1.6.0-rc4>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will
> not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
>   Notable changes since 1.6 RC3
>
>
>    - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>    - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>    - SPARK-12395 - Fix join columns of outer join for DataFrame using
>    - SPARK-12413 - Fix mesos HA
>
>
>
>   Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
>
>   Notable changes since 1.6 RC1
>
>
>       Spark Streaming
>
>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>     |trackStateByKey| has been renamed to |mapWithState|
>
>
>       Spark SQL
>
>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>     SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>     bugs in eviction of storage memory by execution.
>   * SPARK-12258
>     <https://issues.apache.org/jira/browse/SPARK-12258> correct passing
>     null into ScalaUDF
>
>
>     Notable Features Since 1.5
>
>
>       Spark SQL
>
>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>     Parquet Performance - Improve Parquet scan performance when using
>     flat schemas.
>   * SPARK-10810
>     <https://issues.apache.org/jira/browse/SPARK-10810>Session
>     Management - Isolated devault database (i.e |USE mydb|) even on
>     shared clusters.
>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>     Dataset API - A type-safe API (similar to RDDs) that performs many
>     operations on serialized binary data and code generation (i.e.
>     Project Tungsten).
>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>     Unified Memory Management - Shared memory for execution and caching
>     instead of exclusive division of the regions.
>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>     Queries on Files - Concise syntax for running SQL queries over files
>     of any supported format without registering a table.
>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>     Reading non-standard JSON files - Added options to read non-standard
>     JSON files (e.g. single-quotes, unquoted attributes)
>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>     Per-operator Metrics for SQL Execution - Display statistics on a
>     peroperator basis for memory usage and spilled data size.
>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>     (*) expansion for StructTypes - Makes it easier to nest and unest
>     arbitrary numbers of columns
>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>     In-memory Columnar Cache Performance - Significant (up to 14x) speed
>     up when caching data that contains complex types in DataFrames or SQL.
>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>     null-safe joins - Joins using null-safe equality (|<=>|) will now
>     execute using SortMergeJoin instead of computing a cartisian product.
>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>     Execution Using Off-Heap Memory - Support for configuring query
>     execution to occur using off-heap memory to avoid GC overhead
>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>     Datasource API Avoid Double Filter - When implemeting a datasource
>     with filter pushdown, developers can now tell Spark SQL to avoid
>     double evaluating a pushed-down filter.
>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>     Advanced Layout of Cached Data - storing partitioning and ordering
>     schemes in In-memory table scan, and adding distributeBy and
>     localSort to DF API
>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>     Adaptive query execution - Intial support for automatically
>     selecting the number of reducers for joins and aggregations.
>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>     Improved query planner for queries having distinct aggregations -
>     Query plans of distinct aggregations are more robust when distinct
>     columns have high cardinality.
>
>
>       Spark Streaming
>
>   * API Updates
>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>         New improved state management - |mapWithState| - a DStream
>         transformation for stateful stream processing, supercedes
>         |updateStateByKey| in functionality and performance.
>       o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>         Kinesis record deaggregation - Kinesis streams have been
>         upgraded to use KCL 1.4.0 and supports transparent deaggregation
>         of KPL-aggregated records.
>       o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>         Kinesis message handler function - Allows arbitraray function to
>         be applied to a Kinesis record in the Kinesis receiver before to
>         customize what data is to be stored in memory.
>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>         Python Streamng Listener API - Get streaming statistics
>         (scheduling delays, batch processing times, etc.) in streaming.
>
>   * UI Improvements
>       o Made failures visible in the streaming tab, in the timelines,
>         batch list, and batch details page.
>       o Made output operations visible in the streaming tab as progress
>         bars.
>
>
>       MLlib
>
>
>         New algorithms/models
>
>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>     Survival analysis - Log-linear model for survival analysis
>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>     equation for least squares - Normal equation solver, providing
>     R-like model summary statistics
>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>     hypothesis testing - A/B testing in the Spark Streaming framework
>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>     transformer
>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>     Bisecting K-Means clustering - Fast top-down clustering variant of
>     K-Means
>
>
>         API improvements
>
>   * ML Pipelines
>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>         Pipeline persistence - Save/load for ML Pipelines, with partial
>         coverage of spark.ml <http://spark.ml/>algorithms
>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
>         Pipelines
>   * R API
>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>         R-like statistics for GLMs - (Partial) R-like stats for ordinary
>         least squares via summary(model)
>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>         Feature interactions in R formula - Interaction operator ":" in
>         R formula
>   * Python API - Many improvements to Python API to approach feature parity
>
>
>         Misc improvements
>
>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>     Instance weights for GLMs - Logistic and Linear Regression can take
>     instance weights
>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>     Univariate and bivariate statistics in DataFrames - Variance,
>     stddev, correlations, etc.
>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>     LIBSVM data source - LIBSVM as a SQL data source
>
>
>             Documentation improvements
>
>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>     versions - Documentation includes initial version when classes and
>     methods were added
>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>     Testable example code - Automated testing for code in user guide
>     examples
>
>
>     Deprecations
>
>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>     deprecated.
>   * In spark.ml.classification.LogisticRegressionModel and
>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>     been deprecated, in favor of the new name "coefficients." This helps
>     disambiguate from instance (row) weights given to algorithms.
>
>
>     Changes of behavior
>
>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>     semantics in 1.6. Previously, it was a threshold for absolute change
>     in error. Now, it resembles the behavior of GradientDescent
>     convergenceTol: For large errors, it uses relative error (relative
>     to the previous error); for small errors (< 0.01), it uses absolute
>     error.
>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>     strings to lowercase before tokenizing. Now, it converts to
>     lowercase by default, with an option not to. This matches the
>     behavior of the simpler Tokenizer transformer.
>   * Spark SQL's partition discovery has been changed to only discover
>     partition directories that are children of the given path. (i.e. if
>     |path="/my/data/x=1"| then |x=1| will no longer be considered a
>     partition but only children of |x=1|.) This behavior can be
>     overridden by manually specifying the |basePath| that partitioning
>     discovery should start with (SPARK-11678
>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>   * When casting a value of an integral type to timestamp (e.g. casting
>     a long value to timestamp), the value is treated as being in seconds
>     instead of milliseconds (SPARK-11724
>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>   * With the improved query planner for queries having distinct
>     aggregations (SPARK-9241
>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>     query having a single distinct aggregation has been changed to a
>     more robust version. To switch back to the plan generated by Spark
>     1.5's planner, please set
>     |spark.sql.specializeSingleDistinctAggPlanning| to
>     |true| (SPARK-12077
>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Ted Yu <yu...@gmail.com>.

I found that SBT build for Scala 2.11 has been failing (
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-branch-1.6-COMPILE-SBT-SCALA-2.11/3/consoleFull
)

I logged SPARK-12527 and sent a PR.

FYI

On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC3
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Bhupendra Mishra <bh...@gmail.com>.

+1

On Fri, Dec 25, 2015 at 8:31 PM, vaquar khan <va...@gmail.com> wrote:

> +1
> On 24 Dec 2015 22:01, "Vinay Shukla" <vi...@gmail.com> wrote:
>
>> +1
>> Tested on HDP 2.3, YARN cluster mode, spark-shell
>>
>> On Wed, Dec 23, 2015 at 6:14 AM, Allen Zhang <al...@126.com>
>> wrote:
>>
>>>
>>> +1 (non-binding)
>>>
>>> I have just tarball a new binary and tested am.nodelabelexpression and
>>> executor.nodelabelexpression manully, result is expected.
>>>
>>>
>>>
>>>
>>> At 2015-12-23 21:44:08, "Iulian Dragoș" <iu...@typesafe.com>
>>> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Tested Mesos deployments (client and cluster-mode, fine-grained and
>>> coarse-grained). Things look good
>>> <https://ci.typesafe.com/view/Spark/job/mit-docker-test-ref/8/console>.
>>>
>>> iulian
>>>
>>> On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Docker integration tests still fail for Mark and I, and should
>>>> probably be disabled:
>>>> https://issues.apache.org/jira/browse/SPARK-12426
>>>>
>>>> ... but if anyone else successfully runs these (and I assume Jenkins
>>>> does) then not a blocker.
>>>>
>>>> I'm having intermittent trouble with other tests passing, but nothing
>>>> unusual.
>>>> Sigs and hashes are OK.
>>>>
>>>> We have 30 issues fixed for 1.6.1. All but those resolved in the last
>>>> 24 hours or so should be fixed for 1.6.0 right? I can touch that up.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
>>>> <mi...@databricks.com> wrote:
>>>> > Please vote on releasing the following candidate as Apache Spark
>>>> version
>>>> > 1.6.0!
>>>> >
>>>> > The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>> passes if
>>>> > a majority of at least 3 +1 PMC votes are cast.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> >
>>>> > The tag to be voted on is v1.6.0-rc4
>>>> > (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>> >
>>>> > Release artifacts are signed with the following key:
>>>> > https://people.apache.org/keys/committer/pwendell.asc
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>> >
>>>> > The test repository (versioned as v1.6.0-rc4) for this release can be
>>>> found
>>>> > at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>> >
>>>> > The documentation corresponding to this release can be found at:
>>>> >
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>> >
>>>> > =======================================
>>>> > == How can I help test this release? ==
>>>> > =======================================
>>>> > If you are a Spark user, you can help us test this release by taking
>>>> an
>>>> > existing Spark workload and running on this release candidate, then
>>>> > reporting any regressions.
>>>> >
>>>> > ================================================
>>>> > == What justifies a -1 vote for this release? ==
>>>> > ================================================
>>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes
>>>> > should only occur for significant regressions from 1.5. Bugs already
>>>> present
>>>> > in 1.5, minor regressions, or bugs related to new features will not
>>>> block
>>>> > this release.
>>>> >
>>>> > ===============================================================
>>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> > ===============================================================
>>>> > 1. It is OK for documentation patches to target 1.6.0 and still go
>>>> into
>>>> > branch-1.6, since documentations will be published separately from the
>>>> > release.
>>>> > 2. New features for non-alpha-modules should target 1.7+.
>>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target
>>>> > version.
>>>> >
>>>> >
>>>> > ==================================================
>>>> > == Major changes to help you focus your testing ==
>>>> > ==================================================
>>>> >
>>>> > Notable changes since 1.6 RC3
>>>> >
>>>> >
>>>> >   - SPARK-12404 - Fix serialization error for Datasets with
>>>> > Timestamps/Arrays/Decimal
>>>> >   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>> >   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>> >   - SPARK-12413 - Fix mesos HA
>>>> >
>>>> >
>>>> > Notable changes since 1.6 RC2
>>>> >
>>>> >
>>>> > - SPARK_VERSION has been set correctly
>>>> > - SPARK-12199 ML Docs are publishing correctly
>>>> > - SPARK-12345 Mesos cluster mode has been fixed
>>>> >
>>>> > Notable changes since 1.6 RC1
>>>> >
>>>> > Spark Streaming
>>>> >
>>>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>>>> >
>>>> > Spark SQL
>>>> >
>>>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>>>> execution.
>>>> > SPARK-12258 correct passing null into ScalaUDF
>>>> >
>>>> > Notable Features Since 1.5
>>>> >
>>>> > Spark SQL
>>>> >
>>>> > SPARK-11787 Parquet Performance - Improve Parquet scan performance
>>>> when
>>>> > using flat schemas.
>>>> > SPARK-10810 Session Management - Isolated devault database (i.e USE
>>>> mydb)
>>>> > even on shared clusters.
>>>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>>>> performs
>>>> > many operations on serialized binary data and code generation (i.e.
>>>> Project
>>>> > Tungsten).
>>>> > SPARK-10000 Unified Memory Management - Shared memory for execution
>>>> and
>>>> > caching instead of exclusive division of the regions.
>>>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>>>> queries
>>>> > over files of any supported format without registering a table.
>>>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>>>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>>> > SPARK-10412 Per-operator Metrics for SQL Execution - Display
>>>> statistics on a
>>>> > peroperator basis for memory usage and spilled data size.
>>>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>>>> nest and
>>>> > unest arbitrary numbers of columns
>>>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>>>> Significant
>>>> > (up to 14x) speed up when caching data that contains complex types in
>>>> > DataFrames or SQL.
>>>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
>>>> (<=>) will
>>>> > now execute using SortMergeJoin instead of computing a cartisian
>>>> product.
>>>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>>>> configuring
>>>> > query execution to occur using off-heap memory to avoid GC overhead
>>>> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
>>>> > datasource with filter pushdown, developers can now tell Spark SQL to
>>>> avoid
>>>> > double evaluating a pushed-down filter.
>>>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>>>> > ordering schemes in In-memory table scan, and adding distributeBy and
>>>> > localSort to DF API
>>>> > SPARK-9858  Adaptive query execution - Intial support for
>>>> automatically
>>>> > selecting the number of reducers for joins and aggregations.
>>>> > SPARK-9241  Improved query planner for queries having distinct
>>>> aggregations
>>>> > - Query plans of distinct aggregations are more robust when distinct
>>>> columns
>>>> > have high cardinality.
>>>> >
>>>> > Spark Streaming
>>>> >
>>>> > API Updates
>>>> >
>>>> > SPARK-2629  New improved state management - mapWithState - a DStream
>>>> > transformation for stateful stream processing, supercedes
>>>> updateStateByKey
>>>> > in functionality and performance.
>>>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>>>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>> > KPL-aggregated records.
>>>> > SPARK-10891 Kinesis message handler function - Allows arbitraray
>>>> function to
>>>> > be applied to a Kinesis record in the Kinesis receiver before to
>>>> customize
>>>> > what data is to be stored in memory.
>>>> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
>>>> > (scheduling delays, batch processing times, etc.) in streaming.
>>>> >
>>>> > UI Improvements
>>>> >
>>>> > Made failures visible in the streaming tab, in the timelines, batch
>>>> list,
>>>> > and batch details page.
>>>> > Made output operations visible in the streaming tab as progress bars.
>>>> >
>>>> > MLlib
>>>> >
>>>> > New algorithms/models
>>>> >
>>>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>>>> > SPARK-9834  Normal equation for least squares - Normal equation
>>>> solver,
>>>> > providing R-like model summary statistics
>>>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>>>> Streaming
>>>> > framework
>>>> > SPARK-9930  New feature transformers - ChiSqSelector,
>>>> QuantileDiscretizer,
>>>> > SQL transformer
>>>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>>>> variant
>>>> > of K-Means
>>>> >
>>>> > API improvements
>>>> >
>>>> > ML Pipelines
>>>> >
>>>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>>>> partial
>>>> > coverage of spark.mlalgorithms
>>>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
>>>> in ML
>>>> > Pipelines
>>>> >
>>>> > R API
>>>> >
>>>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>>>> ordinary
>>>> > least squares via summary(model)
>>>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>>>> ":" in
>>>> > R formula
>>>> >
>>>> > Python API - Many improvements to Python API to approach feature
>>>> parity
>>>> >
>>>> > Misc improvements
>>>> >
>>>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
>>>> Linear
>>>> > Regression can take instance weights
>>>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>>>> DataFrames -
>>>> > Variance, stddev, correlations, etc.
>>>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>>>> >
>>>> > Documentation improvements
>>>> >
>>>> > SPARK-7751  @since versions - Documentation includes initial version
>>>> when
>>>> > classes and methods were added
>>>> > SPARK-11337 Testable example code - Automated testing for code in
>>>> user guide
>>>> > examples
>>>> >
>>>> > Deprecations
>>>> >
>>>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>> deprecated.
>>>> > In spark.ml.classification.LogisticRegressionModel and
>>>> > spark.ml.regression.LinearRegressionModel, the "weights" field has
>>>> been
>>>> > deprecated, in favor of the new name "coefficients." This helps
>>>> disambiguate
>>>> > from instance (row) weights given to algorithms.
>>>> >
>>>> > Changes of behavior
>>>> >
>>>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>> semantics in
>>>> > 1.6. Previously, it was a threshold for absolute change in error.
>>>> Now, it
>>>> > resembles the behavior of GradientDescent convergenceTol: For large
>>>> errors,
>>>> > it uses relative error (relative to the previous error); for small
>>>> errors (<
>>>> > 0.01), it uses absolute error.
>>>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>> strings to
>>>> > lowercase before tokenizing. Now, it converts to lowercase by
>>>> default, with
>>>> > an option not to. This matches the behavior of the simpler Tokenizer
>>>> > transformer.
>>>> > Spark SQL's partition discovery has been changed to only discover
>>>> partition
>>>> > directories that are children of the given path. (i.e. if
>>>> > path="/my/data/x=1" then x=1 will no longer be considered a partition
>>>> but
>>>> > only children of x=1.) This behavior can be overridden by manually
>>>> > specifying the basePath that partitioning discovery should start with
>>>> > (SPARK-11678).
>>>> > When casting a value of an integral type to timestamp (e.g. casting a
>>>> long
>>>> > value to timestamp), the value is treated as being in seconds instead
>>>> of
>>>> > milliseconds (SPARK-11724).
>>>> > With the improved query planner for queries having distinct
>>>> aggregations
>>>> > (SPARK-9241), the plan of a query having a single distinct
>>>> aggregation has
>>>> > been changed to a more robust version. To switch back to the plan
>>>> generated
>>>> > by Spark 1.5's planner, please set
>>>> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Iulian Dragos
>>>
>>> ------
>>> Reactive Apps on the JVM
>>> www.typesafe.com
>>>
>>>
>>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by vaquar khan <va...@gmail.com>.

+1
On 24 Dec 2015 22:01, "Vinay Shukla" <vi...@gmail.com> wrote:

> +1
> Tested on HDP 2.3, YARN cluster mode, spark-shell
>
> On Wed, Dec 23, 2015 at 6:14 AM, Allen Zhang <al...@126.com>
> wrote:
>
>>
>> +1 (non-binding)
>>
>> I have just tarball a new binary and tested am.nodelabelexpression and
>> executor.nodelabelexpression manully, result is expected.
>>
>>
>>
>>
>> At 2015-12-23 21:44:08, "Iulian Dragoș" <iu...@typesafe.com>
>> wrote:
>>
>> +1 (non-binding)
>>
>> Tested Mesos deployments (client and cluster-mode, fine-grained and
>> coarse-grained). Things look good
>> <https://ci.typesafe.com/view/Spark/job/mit-docker-test-ref/8/console>.
>>
>> iulian
>>
>> On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Docker integration tests still fail for Mark and I, and should
>>> probably be disabled:
>>> https://issues.apache.org/jira/browse/SPARK-12426
>>>
>>> ... but if anyone else successfully runs these (and I assume Jenkins
>>> does) then not a blocker.
>>>
>>> I'm having intermittent trouble with other tests passing, but nothing
>>> unusual.
>>> Sigs and hashes are OK.
>>>
>>> We have 30 issues fixed for 1.6.1. All but those resolved in the last
>>> 24 hours or so should be fixed for 1.6.0 right? I can touch that up.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
>>> <mi...@databricks.com> wrote:
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version
>>> > 1.6.0!
>>> >
>>> > The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>> passes if
>>> > a majority of at least 3 +1 PMC votes are cast.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 1.6.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v1.6.0-rc4
>>> > (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>> >
>>> > Release artifacts are signed with the following key:
>>> > https://people.apache.org/keys/committer/pwendell.asc
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>> >
>>> > The test repository (versioned as v1.6.0-rc4) for this release can be
>>> found
>>> > at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>> >
>>> > =======================================
>>> > == How can I help test this release? ==
>>> > =======================================
>>> > If you are a Spark user, you can help us test this release by taking an
>>> > existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > ================================================
>>> > == What justifies a -1 vote for this release? ==
>>> > ================================================
>>> > This vote is happening towards the end of the 1.6 QA period, so -1
>>> votes
>>> > should only occur for significant regressions from 1.5. Bugs already
>>> present
>>> > in 1.5, minor regressions, or bugs related to new features will not
>>> block
>>> > this release.
>>> >
>>> > ===============================================================
>>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> > ===============================================================
>>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> > branch-1.6, since documentations will be published separately from the
>>> > release.
>>> > 2. New features for non-alpha-modules should target 1.7+.
>>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target
>>> > version.
>>> >
>>> >
>>> > ==================================================
>>> > == Major changes to help you focus your testing ==
>>> > ==================================================
>>> >
>>> > Notable changes since 1.6 RC3
>>> >
>>> >
>>> >   - SPARK-12404 - Fix serialization error for Datasets with
>>> > Timestamps/Arrays/Decimal
>>> >   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>> >   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>> >   - SPARK-12413 - Fix mesos HA
>>> >
>>> >
>>> > Notable changes since 1.6 RC2
>>> >
>>> >
>>> > - SPARK_VERSION has been set correctly
>>> > - SPARK-12199 ML Docs are publishing correctly
>>> > - SPARK-12345 Mesos cluster mode has been fixed
>>> >
>>> > Notable changes since 1.6 RC1
>>> >
>>> > Spark Streaming
>>> >
>>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>>> >
>>> > Spark SQL
>>> >
>>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>>> execution.
>>> > SPARK-12258 correct passing null into ScalaUDF
>>> >
>>> > Notable Features Since 1.5
>>> >
>>> > Spark SQL
>>> >
>>> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
>>> > using flat schemas.
>>> > SPARK-10810 Session Management - Isolated devault database (i.e USE
>>> mydb)
>>> > even on shared clusters.
>>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>>> performs
>>> > many operations on serialized binary data and code generation (i.e.
>>> Project
>>> > Tungsten).
>>> > SPARK-10000 Unified Memory Management - Shared memory for execution and
>>> > caching instead of exclusive division of the regions.
>>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>>> queries
>>> > over files of any supported format without registering a table.
>>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>> > SPARK-10412 Per-operator Metrics for SQL Execution - Display
>>> statistics on a
>>> > peroperator basis for memory usage and spilled data size.
>>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>>> nest and
>>> > unest arbitrary numbers of columns
>>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>>> Significant
>>> > (up to 14x) speed up when caching data that contains complex types in
>>> > DataFrames or SQL.
>>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
>>> (<=>) will
>>> > now execute using SortMergeJoin instead of computing a cartisian
>>> product.
>>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>>> configuring
>>> > query execution to occur using off-heap memory to avoid GC overhead
>>> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
>>> > datasource with filter pushdown, developers can now tell Spark SQL to
>>> avoid
>>> > double evaluating a pushed-down filter.
>>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>>> > ordering schemes in In-memory table scan, and adding distributeBy and
>>> > localSort to DF API
>>> > SPARK-9858  Adaptive query execution - Intial support for automatically
>>> > selecting the number of reducers for joins and aggregations.
>>> > SPARK-9241  Improved query planner for queries having distinct
>>> aggregations
>>> > - Query plans of distinct aggregations are more robust when distinct
>>> columns
>>> > have high cardinality.
>>> >
>>> > Spark Streaming
>>> >
>>> > API Updates
>>> >
>>> > SPARK-2629  New improved state management - mapWithState - a DStream
>>> > transformation for stateful stream processing, supercedes
>>> updateStateByKey
>>> > in functionality and performance.
>>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>> > KPL-aggregated records.
>>> > SPARK-10891 Kinesis message handler function - Allows arbitraray
>>> function to
>>> > be applied to a Kinesis record in the Kinesis receiver before to
>>> customize
>>> > what data is to be stored in memory.
>>> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
>>> > (scheduling delays, batch processing times, etc.) in streaming.
>>> >
>>> > UI Improvements
>>> >
>>> > Made failures visible in the streaming tab, in the timelines, batch
>>> list,
>>> > and batch details page.
>>> > Made output operations visible in the streaming tab as progress bars.
>>> >
>>> > MLlib
>>> >
>>> > New algorithms/models
>>> >
>>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>>> > SPARK-9834  Normal equation for least squares - Normal equation solver,
>>> > providing R-like model summary statistics
>>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>>> Streaming
>>> > framework
>>> > SPARK-9930  New feature transformers - ChiSqSelector,
>>> QuantileDiscretizer,
>>> > SQL transformer
>>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>>> variant
>>> > of K-Means
>>> >
>>> > API improvements
>>> >
>>> > ML Pipelines
>>> >
>>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>>> partial
>>> > coverage of spark.mlalgorithms
>>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
>>> in ML
>>> > Pipelines
>>> >
>>> > R API
>>> >
>>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>>> ordinary
>>> > least squares via summary(model)
>>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>>> ":" in
>>> > R formula
>>> >
>>> > Python API - Many improvements to Python API to approach feature parity
>>> >
>>> > Misc improvements
>>> >
>>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
>>> Linear
>>> > Regression can take instance weights
>>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>>> DataFrames -
>>> > Variance, stddev, correlations, etc.
>>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>>> >
>>> > Documentation improvements
>>> >
>>> > SPARK-7751  @since versions - Documentation includes initial version
>>> when
>>> > classes and methods were added
>>> > SPARK-11337 Testable example code - Automated testing for code in user
>>> guide
>>> > examples
>>> >
>>> > Deprecations
>>> >
>>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>> deprecated.
>>> > In spark.ml.classification.LogisticRegressionModel and
>>> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>> > deprecated, in favor of the new name "coefficients." This helps
>>> disambiguate
>>> > from instance (row) weights given to algorithms.
>>> >
>>> > Changes of behavior
>>> >
>>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>> semantics in
>>> > 1.6. Previously, it was a threshold for absolute change in error. Now,
>>> it
>>> > resembles the behavior of GradientDescent convergenceTol: For large
>>> errors,
>>> > it uses relative error (relative to the previous error); for small
>>> errors (<
>>> > 0.01), it uses absolute error.
>>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>> strings to
>>> > lowercase before tokenizing. Now, it converts to lowercase by default,
>>> with
>>> > an option not to. This matches the behavior of the simpler Tokenizer
>>> > transformer.
>>> > Spark SQL's partition discovery has been changed to only discover
>>> partition
>>> > directories that are children of the given path. (i.e. if
>>> > path="/my/data/x=1" then x=1 will no longer be considered a partition
>>> but
>>> > only children of x=1.) This behavior can be overridden by manually
>>> > specifying the basePath that partitioning discovery should start with
>>> > (SPARK-11678).
>>> > When casting a value of an integral type to timestamp (e.g. casting a
>>> long
>>> > value to timestamp), the value is treated as being in seconds instead
>>> of
>>> > milliseconds (SPARK-11724).
>>> > With the improved query planner for queries having distinct
>>> aggregations
>>> > (SPARK-9241), the plan of a query having a single distinct aggregation
>>> has
>>> > been changed to a more robust version. To switch back to the plan
>>> generated
>>> > by Spark 1.5's planner, please set
>>> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> --
>> Iulian Dragos
>>
>> ------
>> Reactive Apps on the JVM
>> www.typesafe.com
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Vinay Shukla <vi...@gmail.com>.

+1
Tested on HDP 2.3, YARN cluster mode, spark-shell

On Wed, Dec 23, 2015 at 6:14 AM, Allen Zhang <al...@126.com> wrote:

>
> +1 (non-binding)
>
> I have just tarball a new binary and tested am.nodelabelexpression and
> executor.nodelabelexpression manully, result is expected.
>
>
>
>
> At 2015-12-23 21:44:08, "Iulian Dragoș" <iu...@typesafe.com>
> wrote:
>
> +1 (non-binding)
>
> Tested Mesos deployments (client and cluster-mode, fine-grained and
> coarse-grained). Things look good
> <https://ci.typesafe.com/view/Spark/job/mit-docker-test-ref/8/console>.
>
> iulian
>
> On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Docker integration tests still fail for Mark and I, and should
>> probably be disabled:
>> https://issues.apache.org/jira/browse/SPARK-12426
>>
>> ... but if anyone else successfully runs these (and I assume Jenkins
>> does) then not a blocker.
>>
>> I'm having intermittent trouble with other tests passing, but nothing
>> unusual.
>> Sigs and hashes are OK.
>>
>> We have 30 issues fixed for 1.6.1. All but those resolved in the last
>> 24 hours or so should be fixed for 1.6.0 right? I can touch that up.
>>
>>
>>
>>
>>
>> On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
>> <mi...@databricks.com> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>> passes if
>> > a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc4
>> > (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1176/
>> >
>> > The test repository (versioned as v1.6.0-rc4) for this release can be
>> found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1175/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>> >
>> > =======================================
>> > == How can I help test this release? ==
>> > =======================================
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > ================================================
>> > == What justifies a -1 vote for this release? ==
>> > ================================================
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> present
>> > in 1.5, minor regressions, or bugs related to new features will not
>> block
>> > this release.
>> >
>> > ===============================================================
>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>> > ===============================================================
>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>> > branch-1.6, since documentations will be published separately from the
>> > release.
>> > 2. New features for non-alpha-modules should target 1.7+.
>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> target
>> > version.
>> >
>> >
>> > ==================================================
>> > == Major changes to help you focus your testing ==
>> > ==================================================
>> >
>> > Notable changes since 1.6 RC3
>> >
>> >
>> >   - SPARK-12404 - Fix serialization error for Datasets with
>> > Timestamps/Arrays/Decimal
>> >   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>> >   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>> >   - SPARK-12413 - Fix mesos HA
>> >
>> >
>> > Notable changes since 1.6 RC2
>> >
>> >
>> > - SPARK_VERSION has been set correctly
>> > - SPARK-12199 ML Docs are publishing correctly
>> > - SPARK-12345 Mesos cluster mode has been fixed
>> >
>> > Notable changes since 1.6 RC1
>> >
>> > Spark Streaming
>> >
>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>> >
>> > Spark SQL
>> >
>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>> execution.
>> > SPARK-12258 correct passing null into ScalaUDF
>> >
>> > Notable Features Since 1.5
>> >
>> > Spark SQL
>> >
>> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
>> > using flat schemas.
>> > SPARK-10810 Session Management - Isolated devault database (i.e USE
>> mydb)
>> > even on shared clusters.
>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>> performs
>> > many operations on serialized binary data and code generation (i.e.
>> Project
>> > Tungsten).
>> > SPARK-10000 Unified Memory Management - Shared memory for execution and
>> > caching instead of exclusive division of the regions.
>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>> queries
>> > over files of any supported format without registering a table.
>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
>> on a
>> > peroperator basis for memory usage and spilled data size.
>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
>> nest and
>> > unest arbitrary numbers of columns
>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>> Significant
>> > (up to 14x) speed up when caching data that contains complex types in
>> > DataFrames or SQL.
>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
>> will
>> > now execute using SortMergeJoin instead of computing a cartisian
>> product.
>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>> configuring
>> > query execution to occur using off-heap memory to avoid GC overhead
>> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
>> > datasource with filter pushdown, developers can now tell Spark SQL to
>> avoid
>> > double evaluating a pushed-down filter.
>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>> > ordering schemes in In-memory table scan, and adding distributeBy and
>> > localSort to DF API
>> > SPARK-9858  Adaptive query execution - Intial support for automatically
>> > selecting the number of reducers for joins and aggregations.
>> > SPARK-9241  Improved query planner for queries having distinct
>> aggregations
>> > - Query plans of distinct aggregations are more robust when distinct
>> columns
>> > have high cardinality.
>> >
>> > Spark Streaming
>> >
>> > API Updates
>> >
>> > SPARK-2629  New improved state management - mapWithState - a DStream
>> > transformation for stateful stream processing, supercedes
>> updateStateByKey
>> > in functionality and performance.
>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>> > KPL-aggregated records.
>> > SPARK-10891 Kinesis message handler function - Allows arbitraray
>> function to
>> > be applied to a Kinesis record in the Kinesis receiver before to
>> customize
>> > what data is to be stored in memory.
>> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
>> > (scheduling delays, batch processing times, etc.) in streaming.
>> >
>> > UI Improvements
>> >
>> > Made failures visible in the streaming tab, in the timelines, batch
>> list,
>> > and batch details page.
>> > Made output operations visible in the streaming tab as progress bars.
>> >
>> > MLlib
>> >
>> > New algorithms/models
>> >
>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>> > SPARK-9834  Normal equation for least squares - Normal equation solver,
>> > providing R-like model summary statistics
>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>> Streaming
>> > framework
>> > SPARK-9930  New feature transformers - ChiSqSelector,
>> QuantileDiscretizer,
>> > SQL transformer
>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>> variant
>> > of K-Means
>> >
>> > API improvements
>> >
>> > ML Pipelines
>> >
>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>> partial
>> > coverage of spark.mlalgorithms
>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
>> in ML
>> > Pipelines
>> >
>> > R API
>> >
>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>> ordinary
>> > least squares via summary(model)
>> > SPARK-9681  Feature interactions in R formula - Interaction operator
>> ":" in
>> > R formula
>> >
>> > Python API - Many improvements to Python API to approach feature parity
>> >
>> > Misc improvements
>> >
>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
>> > Regression can take instance weights
>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>> DataFrames -
>> > Variance, stddev, correlations, etc.
>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>> >
>> > Documentation improvements
>> >
>> > SPARK-7751  @since versions - Documentation includes initial version
>> when
>> > classes and methods were added
>> > SPARK-11337 Testable example code - Automated testing for code in user
>> guide
>> > examples
>> >
>> > Deprecations
>> >
>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> deprecated.
>> > In spark.ml.classification.LogisticRegressionModel and
>> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> > deprecated, in favor of the new name "coefficients." This helps
>> disambiguate
>> > from instance (row) weights given to algorithms.
>> >
>> > Changes of behavior
>> >
>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> semantics in
>> > 1.6. Previously, it was a threshold for absolute change in error. Now,
>> it
>> > resembles the behavior of GradientDescent convergenceTol: For large
>> errors,
>> > it uses relative error (relative to the previous error); for small
>> errors (<
>> > 0.01), it uses absolute error.
>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
>> to
>> > lowercase before tokenizing. Now, it converts to lowercase by default,
>> with
>> > an option not to. This matches the behavior of the simpler Tokenizer
>> > transformer.
>> > Spark SQL's partition discovery has been changed to only discover
>> partition
>> > directories that are children of the given path. (i.e. if
>> > path="/my/data/x=1" then x=1 will no longer be considered a partition
>> but
>> > only children of x=1.) This behavior can be overridden by manually
>> > specifying the basePath that partitioning discovery should start with
>> > (SPARK-11678).
>> > When casting a value of an integral type to timestamp (e.g. casting a
>> long
>> > value to timestamp), the value is treated as being in seconds instead of
>> > milliseconds (SPARK-11724).
>> > With the improved query planner for queries having distinct aggregations
>> > (SPARK-9241), the plan of a query having a single distinct aggregation
>> has
>> > been changed to a more robust version. To switch back to the plan
>> generated
>> > by Spark 1.5's planner, please set
>> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>
>
> --
>
> --
> Iulian Dragos
>
> ------
> Reactive Apps on the JVM
> www.typesafe.com
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Allen Zhang <al...@126.com>.


+1 (non-binding)


I have just tarball a new binary and tested am.nodelabelexpression and executor.nodelabelexpression manully, result is expected.





At 2015-12-23 21:44:08, "Iulian Dragoș" <iu...@typesafe.com> wrote:

+1 (non-binding)


Tested Mesos deployments (client and cluster-mode, fine-grained and coarse-grained). Things look good.


iulian


On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen <so...@cloudera.com> wrote:
Docker integration tests still fail for Mark and I, and should
probably be disabled:
https://issues.apache.org/jira/browse/SPARK-12426

... but if anyone else successfully runs these (and I assume Jenkins
does) then not a blocker.

I'm having intermittent trouble with other tests passing, but nothing unusual.
Sigs and hashes are OK.

We have 30 issues fixed for 1.6.1. All but those resolved in the last
24 hours or so should be fixed for 1.6.0 right? I can touch that up.






On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC3
>
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858  Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241  Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629  New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328  Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518  Survival analysis - Log-linear model for survival analysis
> SPARK-9834  Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.mlalgorithms
> SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681  Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751  @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org







--


--
Iulian Dragos



------
Reactive Apps on the JVM
www.typesafe.com

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Iulian Dragoș <iu...@typesafe.com>.

+1 (non-binding)

Tested Mesos deployments (client and cluster-mode, fine-grained and
coarse-grained). Things look good
<https://ci.typesafe.com/view/Spark/job/mit-docker-test-ref/8/console>.

iulian

On Wed, Dec 23, 2015 at 2:35 PM, Sean Owen <so...@cloudera.com> wrote:

> Docker integration tests still fail for Mark and I, and should
> probably be disabled:
> https://issues.apache.org/jira/browse/SPARK-12426
>
> ... but if anyone else successfully runs these (and I assume Jenkins
> does) then not a blocker.
>
> I'm having intermittent trouble with other tests passing, but nothing
> unusual.
> Sigs and hashes are OK.
>
> We have 30 issues fixed for 1.6.1. All but those resolved in the last
> 24 hours or so should be fixed for 1.6.0 right? I can touch that up.
>
>
>
>
>
> On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
> <mi...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc4
> > (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1176/
> >
> > The test repository (versioned as v1.6.0-rc4) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1175/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
> >
> > =======================================
> > == How can I help test this release? ==
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ================================================
> > == What justifies a -1 vote for this release? ==
> > ================================================
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===============================================================
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > == Major changes to help you focus your testing ==
> > ==================================================
> >
> > Notable changes since 1.6 RC3
> >
> >
> >   - SPARK-12404 - Fix serialization error for Datasets with
> > Timestamps/Arrays/Decimal
> >   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
> >   - SPARK-12395 - Fix join columns of outer join for DataFrame using
> >   - SPARK-12413 - Fix mesos HA
> >
> >
> > Notable changes since 1.6 RC2
> >
> >
> > - SPARK_VERSION has been set correctly
> > - SPARK-12199 ML Docs are publishing correctly
> > - SPARK-12345 Mesos cluster mode has been fixed
> >
> > Notable changes since 1.6 RC1
> >
> > Spark Streaming
> >
> > SPARK-2629  trackStateByKey has been renamed to mapWithState
> >
> > Spark SQL
> >
> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> execution.
> > SPARK-12258 correct passing null into ScalaUDF
> >
> > Notable Features Since 1.5
> >
> > Spark SQL
> >
> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> > using flat schemas.
> > SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> > even on shared clusters.
> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-10000 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard JSON files - Added options to read
> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
> on a
> > peroperator basis for memory usage and spilled data size.
> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
> and
> > unest arbitrary numbers of columns
> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> Significant
> > (up to 14x) speed up when caching data that contains complex types in
> > DataFrames or SQL.
> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
> will
> > now execute using SortMergeJoin instead of computing a cartisian product.
> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> > query execution to occur using off-heap memory to avoid GC overhead
> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> > datasource with filter pushdown, developers can now tell Spark SQL to
> avoid
> > double evaluating a pushed-down filter.
> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> > ordering schemes in In-memory table scan, and adding distributeBy and
> > localSort to DF API
> > SPARK-9858  Adaptive query execution - Intial support for automatically
> > selecting the number of reducers for joins and aggregations.
> > SPARK-9241  Improved query planner for queries having distinct
> aggregations
> > - Query plans of distinct aggregations are more robust when distinct
> columns
> > have high cardinality.
> >
> > Spark Streaming
> >
> > API Updates
> >
> > SPARK-2629  New improved state management - mapWithState - a DStream
> > transformation for stateful stream processing, supercedes
> updateStateByKey
> > in functionality and performance.
> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> > KPL-aggregated records.
> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> function to
> > be applied to a Kinesis record in the Kinesis receiver before to
> customize
> > what data is to be stored in memory.
> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
> > (scheduling delays, batch processing times, etc.) in streaming.
> >
> > UI Improvements
> >
> > Made failures visible in the streaming tab, in the timelines, batch list,
> > and batch details page.
> > Made output operations visible in the streaming tab as progress bars.
> >
> > MLlib
> >
> > New algorithms/models
> >
> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
> > SPARK-9834  Normal equation for least squares - Normal equation solver,
> > providing R-like model summary statistics
> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
> Streaming
> > framework
> > SPARK-9930  New feature transformers - ChiSqSelector,
> QuantileDiscretizer,
> > SQL transformer
> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
> variant
> > of K-Means
> >
> > API improvements
> >
> > ML Pipelines
> >
> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
> partial
> > coverage of spark.mlalgorithms
> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML
> > Pipelines
> >
> > R API
> >
> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
> ordinary
> > least squares via summary(model)
> > SPARK-9681  Feature interactions in R formula - Interaction operator ":"
> in
> > R formula
> >
> > Python API - Many improvements to Python API to approach feature parity
> >
> > Misc improvements
> >
> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> > Regression can take instance weights
> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> DataFrames -
> > Variance, stddev, correlations, etc.
> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >
> > Documentation improvements
> >
> > SPARK-7751  @since versions - Documentation includes initial version when
> > classes and methods were added
> > SPARK-11337 Testable example code - Automated testing for code in user
> guide
> > examples
> >
> > Deprecations
> >
> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> > In spark.ml.classification.LogisticRegressionModel and
> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
> > deprecated, in favor of the new name "coefficients." This helps
> disambiguate
> > from instance (row) weights given to algorithms.
> >
> > Changes of behavior
> >
> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in
> > 1.6. Previously, it was a threshold for absolute change in error. Now, it
> > resembles the behavior of GradientDescent convergenceTol: For large
> errors,
> > it uses relative error (relative to the previous error); for small
> errors (<
> > 0.01), it uses absolute error.
> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
> to
> > lowercase before tokenizing. Now, it converts to lowercase by default,
> with
> > an option not to. This matches the behavior of the simpler Tokenizer
> > transformer.
> > Spark SQL's partition discovery has been changed to only discover
> partition
> > directories that are children of the given path. (i.e. if
> > path="/my/data/x=1" then x=1 will no longer be considered a partition but
> > only children of x=1.) This behavior can be overridden by manually
> > specifying the basePath that partitioning discovery should start with
> > (SPARK-11678).
> > When casting a value of an integral type to timestamp (e.g. casting a
> long
> > value to timestamp), the value is treated as being in seconds instead of
> > milliseconds (SPARK-11724).
> > With the improved query planner for queries having distinct aggregations
> > (SPARK-9241), the plan of a query having a single distinct aggregation
> has
> > been changed to a more robust version. To switch back to the plan
> generated
> > by Spark 1.5's planner, please set
> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>


-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Sean Owen <so...@cloudera.com>.

Docker integration tests still fail for Mark and I, and should
probably be disabled:
https://issues.apache.org/jira/browse/SPARK-12426

... but if anyone else successfully runs these (and I assume Jenkins
does) then not a blocker.

I'm having intermittent trouble with other tests passing, but nothing unusual.
Sigs and hashes are OK.

We have 30 issues fixed for 1.6.1. All but those resolved in the last
24 hours or so should be fixed for 1.6.0 right? I can touch that up.





On Tue, Dec 22, 2015 at 8:10 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC3
>
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858  Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241  Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629  New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328  Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518  Survival analysis - Log-linear model for survival analysis
> SPARK-9834  Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.mlalgorithms
> SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681  Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751  @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Krishna Sankar <ks...@gmail.com>.

+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:25 min
     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.3.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Cheers & Holiday "Spark-ling" Wishes !
<k/>

On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC3
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Ricardo Almeida <ri...@actnowib.com>.

+1 (non binding)
Tested Python API, Spark Core, Spark SQL, Spark MLlib  on a standalone
cluster 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC4-tp15747p15800.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Michael Armbrust <mi...@databricks.com>.

Thanks for testing and voting everyone.  The vote passes unanimously with
21 +1 votes and no -1 votes.  I start finalizing the release now.

+1
Michael Armbrust*
Reynold Xin*
Andrew Or*
Benjamin Fradet
Mark Hamstra*
Jeff Zhang
Josh Rosen*
Aaron Davidson*
Denny Lee
Yin Huai
Jean-Baptiste Onofré
Kousuke Saruta
Zsolt Tóth
Iulian Dragoș
Allen Zhang
Vinay Shukla
Vaquar Khan
Bhupendra Mishra
Krishna Sankar
Ricardo Almeida
Cheng Lian

-1:
none

On Sat, Dec 26, 2015 at 6:11 AM, Cheng Lian <li...@gmail.com> wrote:

> +1
>
> On 12/23/15 12:39 PM, Yin Huai wrote:
>
> +1
>
> On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee <de...@gmail.com> wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson <il...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen < <jo...@databricks.com>
>>> joshrosen@databricks.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang < <zj...@gmail.com>
>>>> zjffdu@gmail.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <
>>>>> <ma...@clearstorydata.com> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>>>>> <mi...@databricks.com> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 1.6.0!
>>>>>>>
>>>>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> <http://spark.apache.org/>http://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is *v1.6.0-rc4
>>>>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>>
>>>>>>> <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc4-bin/>
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> <https://people.apache.org/keys/committer/pwendell.asc>
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>>
>>>>>>> <https://repository.apache.org/content/repositories/orgapachespark-1176/>
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>>>>
>>>>>>> The test repository (versioned as v1.6.0-rc4) for this release can
>>>>>>> be found at:
>>>>>>>
>>>>>>> <https://repository.apache.org/content/repositories/orgapachespark-1175/>
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>
>>>>>>> <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc4-docs/>
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>>>>
>>>>>>> =======================================
>>>>>>> == How can I help test this release? ==
>>>>>>> =======================================
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> ================================================
>>>>>>> == What justifies a -1 vote for this release? ==
>>>>>>> ================================================
>>>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>>>>> block this release.
>>>>>>>
>>>>>>> ===============================================================
>>>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>>>> ===============================================================
>>>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>>>> into branch-1.6, since documentations will be published separately from the
>>>>>>> release.
>>>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>>>> target version.
>>>>>>>
>>>>>>>
>>>>>>> ==================================================
>>>>>>> == Major changes to help you focus your testing ==
>>>>>>> ==================================================
>>>>>>>
>>>>>>> Notable changes since 1.6 RC3
>>>>>>>
>>>>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>>>>> Timestamps/Arrays/Decimal
>>>>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>>>>   - SPARK-12413 - Fix mesos HA
>>>>>>>
>>>>>>> Notable changes since 1.6 RC2
>>>>>>> - SPARK_VERSION has been set correctly
>>>>>>> - SPARK-12199 ML Docs are publishing correctly
>>>>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>>>>
>>>>>>> Notable changes since 1.6 RC1
>>>>>>> Spark Streaming
>>>>>>>
>>>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>>>
>>>>>>> Spark SQL
>>>>>>>
>>>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>>>     SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>>>    bugs in eviction of storage memory by execution.
>>>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>>>>    passing null into ScalaUDF
>>>>>>>
>>>>>>> Notable Features Since 1.5 Spark SQL
>>>>>>>
>>>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>>>>>>>     Parquet Performance - Improve Parquet scan performance when
>>>>>>>    using flat schemas.
>>>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>>>     Session Management - Isolated devault database (i.e USE mydb)
>>>>>>>    even on shared clusters.
>>>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>>>    Tungsten).
>>>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>>>>>>>     Unified Memory Management - Shared memory for execution and
>>>>>>>    caching instead of exclusive division of the regions.
>>>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197>
>>>>>>>     SQL Queries on Files - Concise syntax for running SQL queries
>>>>>>>    over files of any supported format without registering a table.
>>>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>>>>>>>     Reading non-standard JSON files - Added options to read
>>>>>>>    non-standard JSON files (e.g. single-quotes, unquoted attributes)
>>>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>>>>>>>     Per-operator Metrics for SQL Execution - Display statistics on
>>>>>>>    a peroperator basis for memory usage and spilled data size.
>>>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329>
>>>>>>>     Star (*) expansion for StructTypes - Makes it easier to nest
>>>>>>>    and unest arbitrary numbers of columns
>>>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>
>>>>>>>    , SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>>>>>>>     In-memory Columnar Cache Performance - Significant (up to 14x)
>>>>>>>    speed up when caching data that contains complex types in DataFrames or SQL.
>>>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111>
>>>>>>>     Fast null-safe joins - Joins using null-safe equality (<=>)
>>>>>>>    will now execute using SortMergeJoin instead of computing a cartisian
>>>>>>>    product.
>>>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389>
>>>>>>>     SQL Execution Using Off-Heap Memory - Support for configuring
>>>>>>>    query execution to occur using off-heap memory to avoid GC overhead
>>>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>>>>>>>     Datasource API Avoid Double Filter - When implemeting a
>>>>>>>    datasource with filter pushdown, developers can now tell Spark SQL to avoid
>>>>>>>    double evaluating a pushed-down filter.
>>>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>>>>    Layout of Cached Data - storing partitioning and ordering
>>>>>>>    schemes in In-memory table scan, and adding distributeBy and localSort to
>>>>>>>    DF API
>>>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>>>>    query execution - Intial support for automatically selecting the
>>>>>>>    number of reducers for joins and aggregations.
>>>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>>>>    query planner for queries having distinct aggregations - Query
>>>>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>>>>    high cardinality.
>>>>>>>
>>>>>>> Spark Streaming
>>>>>>>
>>>>>>>    - API Updates
>>>>>>>       - SPARK-2629
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>>>>       improved state management - mapWithState - a DStream
>>>>>>>       transformation for stateful stream processing, supercedes
>>>>>>>       updateStateByKey in functionality and performance.
>>>>>>>       - SPARK-11198
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>>>       use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
>>>>>>>       records.
>>>>>>>       - SPARK-10891
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>>>       message handler function - Allows arbitraray function to be
>>>>>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>>>>>       what data is to be stored in memory.
>>>>>>>       - SPARK-6328
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>>>>       delays, batch processing times, etc.) in streaming.
>>>>>>>
>>>>>>>
>>>>>>>    - UI Improvements
>>>>>>>       - Made failures visible in the streaming tab, in the
>>>>>>>       timelines, batch list, and batch details page.
>>>>>>>       - Made output operations visible in the streaming tab as
>>>>>>>       progress bars.
>>>>>>>
>>>>>>> MLlib New algorithms/models
>>>>>>>
>>>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>>>>    analysis - Log-linear model for survival analysis
>>>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>>>    equation for least squares - Normal equation solver, providing
>>>>>>>    R-like model summary statistics
>>>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>>>    transformer
>>>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>>>
>>>>>>> API improvements
>>>>>>>
>>>>>>>    - ML Pipelines
>>>>>>>       - SPARK-6725
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>>>>       persistence - Save/load for ML Pipelines, with partial
>>>>>>>       coverage of spark.mlalgorithms
>>>>>>>       - SPARK-5565
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-5565> LDA in ML
>>>>>>>       Pipelines - API for Latent Dirichlet Allocation in ML
>>>>>>>       Pipelines
>>>>>>>    - R API
>>>>>>>       - SPARK-9836
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>>>>       statistics for GLMs - (Partial) R-like stats for ordinary
>>>>>>>       least squares via summary(model)
>>>>>>>       - SPARK-9681
>>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>>>>       interactions in R formula - Interaction operator ":" in R
>>>>>>>       formula
>>>>>>>    - Python API - Many improvements to Python API to approach
>>>>>>>    feature parity
>>>>>>>
>>>>>>> Misc improvements
>>>>>>>
>>>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>
>>>>>>>    , SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>>>    instance weights
>>>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>
>>>>>>>    , SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>>>>>>>     Univariate and bivariate statistics in DataFrames - Variance,
>>>>>>>    stddev, correlations, etc.
>>>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>>>>>>>     LIBSVM data source - LIBSVM as a SQL data source Documentation
>>>>>>>    improvements
>>>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>>>    versions - Documentation includes initial version when classes
>>>>>>>    and methods were added
>>>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>>>>>>>     Testable example code - Automated testing for code in user
>>>>>>>    guide examples
>>>>>>>
>>>>>>> Deprecations
>>>>>>>
>>>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has
>>>>>>>    been deprecated.
>>>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>>>
>>>>>>> Changes of behavior
>>>>>>>
>>>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has
>>>>>>>    changed semantics in 1.6. Previously, it was a threshold for absolute
>>>>>>>    change in error. Now, it resembles the behavior of GradientDescent
>>>>>>>    convergenceTol: For large errors, it uses relative error (relative to the
>>>>>>>    previous error); for small errors (< 0.01), it uses absolute error.
>>>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not
>>>>>>>    convert strings to lowercase before tokenizing. Now, it converts to
>>>>>>>    lowercase by default, with an option not to. This matches the behavior of
>>>>>>>    the simpler Tokenizer transformer.
>>>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>>>    discover partition directories that are children of the given path. (i.e.
>>>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>>>    partition but only children of x=1.) This behavior can be
>>>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>>>    discovery should start with (SPARK-11678
>>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>>>    - With the improved query planner for queries having distinct
>>>>>>>    aggregations (SPARK-9241
>>>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of
>>>>>>>    a query having a single distinct aggregation has been changed to a more
>>>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>>>     to true (SPARK-12077
>>>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Cheng Lian <li...@gmail.com>.

+1

On 12/23/15 12:39 PM, Yin Huai wrote:
> +1
>
> On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee <denny.g.lee@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     +1
>
>     On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson <ilikerps@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         +1
>
>         On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen
>         <joshrosen@databricks.com <ma...@databricks.com>>
>         wrote:
>
>             +1
>
>             On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang
>             <zjffdu@gmail.com <ma...@gmail.com>> wrote:
>
>                 +1
>
>                 On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra
>                 <mark@clearstorydata.com
>                 <ma...@clearstorydata.com>> wrote:
>
>                     +1
>
>                     On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust
>                     <michael@databricks.com
>                     <ma...@databricks.com>> wrote:
>
>                         Please vote on releasing the following
>                         candidate as Apache Spark version 1.6.0!
>
>                         The vote is open until Friday, December 25,
>                         2015 at 18:00 UTC and passes if a majority of
>                         at least 3 +1 PMC votes are cast.
>
>                         [ ] +1 Release this package as Apache Spark 1.6.0
>                         [ ] -1 Do not release this package because ...
>
>                         To learn more about Apache Spark, please see
>                         http://spark.apache.org/
>
>                         The tag to be voted on is _v1.6.0-rc4
>                         (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>                         <https://github.com/apache/spark/tree/v1.6.0-rc4>_
>
>                         The release files, including signatures,
>                         digests, etc. can be found at:
>                         http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>                         <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc4-bin/>
>
>                         Release artifacts are signed with the
>                         following key:
>                         https://people.apache.org/keys/committer/pwendell.asc
>
>                         The staging repository for this release can be
>                         found at:
>                         https://repository.apache.org/content/repositories/orgapachespark-1176/
>
>                         The test repository (versioned as v1.6.0-rc4)
>                         for this release can be found at:
>                         https://repository.apache.org/content/repositories/orgapachespark-1175/
>
>                         The documentation corresponding to this
>                         release can be found at:
>                         http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>                         <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc4-docs/>
>
>                         =======================================
>                         == How can I help test this release? ==
>                         =======================================
>                         If you are a Spark user, you can help us test
>                         this release by taking an existing Spark
>                         workload and running on this release
>                         candidate, then reporting any regressions.
>
>                         ================================================
>                         == What justifies a -1 vote for this release? ==
>                         ================================================
>                         This vote is happening towards the end of the
>                         1.6 QA period, so -1 votes should only occur
>                         for significant regressions from 1.5. Bugs
>                         already present in 1.5, minor regressions, or
>                         bugs related to new features will not block
>                         this release.
>
>                         ===============================================================
>                         == What should happen to JIRA tickets still
>                         targeting 1.6.0? ==
>                         ===============================================================
>                         1. It is OK for documentation patches to
>                         target 1.6.0 and still go into branch-1.6,
>                         since documentations will be published
>                         separately from the release.
>                         2. New features for non-alpha-modules should
>                         target 1.7+.
>                         3. Non-blocker bug fixes should target 1.6.1
>                         or 1.7.0, or drop the target version.
>
>
>                         ==================================================
>                         == Major changes to help you focus your testing ==
>                         ==================================================
>
>
>                           Notable changes since 1.6 RC3
>
>
>                           - SPARK-12404 - Fix serialization error for
>                         Datasets with Timestamps/Arrays/Decimal
>                         - SPARK-12218 - Fix incorrect pushdown of
>                         filters to parquet
>                         - SPARK-12395 - Fix join columns of outer join
>                         for DataFrame using
>                         - SPARK-12413 - Fix mesos HA
>
>
>
>                           Notable changes since 1.6 RC2
>
>
>                         - SPARK_VERSION has been set correctly
>                         - SPARK-12199 ML Docs are publishing correctly
>                         - SPARK-12345 Mesos cluster mode has been fixed
>
>
>                           Notable changes since 1.6 RC1
>
>
>                               Spark Streaming
>
>                           * SPARK-2629
>                             <https://issues.apache.org/jira/browse/SPARK-2629>
>                             |trackStateByKey| has been renamed to
>                             |mapWithState|
>
>
>                               Spark SQL
>
>                           * SPARK-12165
>                             <https://issues.apache.org/jira/browse/SPARK-12165>
>                             SPARK-12189
>                             <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>                             bugs in eviction of storage memory by
>                             execution.
>                           * SPARK-12258
>                             <https://issues.apache.org/jira/browse/SPARK-12258> correct
>                             passing null into ScalaUDF
>
>
>                             Notable Features Since 1.5
>
>
>                               Spark SQL
>
>                           * SPARK-11787
>                             <https://issues.apache.org/jira/browse/SPARK-11787>
>                             Parquet Performance - Improve Parquet scan
>                             performance when using flat schemas.
>                           * SPARK-10810
>                             <https://issues.apache.org/jira/browse/SPARK-10810>Session
>                             Management - Isolated devault database
>                             (i.e |USE mydb|) even on shared clusters.
>                           * SPARK-9999
>                             <https://issues.apache.org/jira/browse/SPARK-9999>
>                             Dataset API - A type-safe API (similar to
>                             RDDs) that performs many operations on
>                             serialized binary data and code generation
>                             (i.e. Project Tungsten).
>                           * SPARK-10000
>                             <https://issues.apache.org/jira/browse/SPARK-10000>
>                             Unified Memory Management - Shared memory
>                             for execution and caching instead of
>                             exclusive division of the regions.
>                           * SPARK-11197
>                             <https://issues.apache.org/jira/browse/SPARK-11197>
>                             SQL Queries on Files - Concise syntax for
>                             running SQL queries over files of any
>                             supported format without registering a table.
>                           * SPARK-11745
>                             <https://issues.apache.org/jira/browse/SPARK-11745>
>                             Reading non-standard JSON files - Added
>                             options to read non-standard JSON files
>                             (e.g. single-quotes, unquoted attributes)
>                           * SPARK-10412
>                             <https://issues.apache.org/jira/browse/SPARK-10412>
>                             Per-operator Metrics for SQL Execution -
>                             Display statistics on a peroperator basis
>                             for memory usage and spilled data size.
>                           * SPARK-11329
>                             <https://issues.apache.org/jira/browse/SPARK-11329>
>                             Star (*) expansion for StructTypes - Makes
>                             it easier to nest and unest arbitrary
>                             numbers of columns
>                           * SPARK-10917
>                             <https://issues.apache.org/jira/browse/SPARK-10917>,
>                             SPARK-11149
>                             <https://issues.apache.org/jira/browse/SPARK-11149>
>                             In-memory Columnar Cache Performance -
>                             Significant (up to 14x) speed up when
>                             caching data that contains complex types
>                             in DataFrames or SQL.
>                           * SPARK-11111
>                             <https://issues.apache.org/jira/browse/SPARK-11111>
>                             Fast null-safe joins - Joins using
>                             null-safe equality (|<=>|) will now
>                             execute using SortMergeJoin instead of
>                             computing a cartisian product.
>                           * SPARK-11389
>                             <https://issues.apache.org/jira/browse/SPARK-11389>
>                             SQL Execution Using Off-Heap Memory -
>                             Support for configuring query execution to
>                             occur using off-heap memory to avoid GC
>                             overhead
>                           * SPARK-10978
>                             <https://issues.apache.org/jira/browse/SPARK-10978>
>                             Datasource API Avoid Double Filter - When
>                             implemeting a datasource with filter
>                             pushdown, developers can now tell Spark
>                             SQL to avoid double evaluating a
>                             pushed-down filter.
>                           * SPARK-4849
>                             <https://issues.apache.org/jira/browse/SPARK-4849>
>                             Advanced Layout of Cached Data - storing
>                             partitioning and ordering schemes in
>                             In-memory table scan, and adding
>                             distributeBy and localSort to DF API
>                           * SPARK-9858
>                             <https://issues.apache.org/jira/browse/SPARK-9858>
>                             Adaptive query execution - Intial support
>                             for automatically selecting the number of
>                             reducers for joins and aggregations.
>                           * SPARK-9241
>                             <https://issues.apache.org/jira/browse/SPARK-9241>
>                             Improved query planner for queries having
>                             distinct aggregations - Query plans of
>                             distinct aggregations are more robust when
>                             distinct columns have high cardinality.
>
>
>                               Spark Streaming
>
>                           * API Updates
>                               o SPARK-2629
>                                 <https://issues.apache.org/jira/browse/SPARK-2629>
>                                 New improved state management -
>                                 |mapWithState| - a DStream
>                                 transformation for stateful stream
>                                 processing, supercedes
>                                 |updateStateByKey| in functionality
>                                 and performance.
>                               o SPARK-11198
>                                 <https://issues.apache.org/jira/browse/SPARK-11198>
>                                 Kinesis record deaggregation - Kinesis
>                                 streams have been upgraded to use KCL
>                                 1.4.0 and supports transparent
>                                 deaggregation of KPL-aggregated records.
>                               o SPARK-10891
>                                 <https://issues.apache.org/jira/browse/SPARK-10891>
>                                 Kinesis message handler function -
>                                 Allows arbitraray function to be
>                                 applied to a Kinesis record in the
>                                 Kinesis receiver before to customize
>                                 what data is to be stored in memory.
>                               o SPARK-6328
>                                 <https://issues.apache.org/jira/browse/SPARK-6328>
>                                 Python Streamng Listener API - Get
>                                 streaming statistics (scheduling
>                                 delays, batch processing times, etc.)
>                                 in streaming.
>
>                           * UI Improvements
>                               o Made failures visible in the streaming
>                                 tab, in the timelines, batch list, and
>                                 batch details page.
>                               o Made output operations visible in the
>                                 streaming tab as progress bars.
>
>
>                               MLlib
>
>
>                                 New algorithms/models
>
>                           * SPARK-8518
>                             <https://issues.apache.org/jira/browse/SPARK-8518>
>                             Survival analysis - Log-linear model for
>                             survival analysis
>                           * SPARK-9834
>                             <https://issues.apache.org/jira/browse/SPARK-9834>
>                             Normal equation for least squares - Normal
>                             equation solver, providing R-like model
>                             summary statistics
>                           * SPARK-3147
>                             <https://issues.apache.org/jira/browse/SPARK-3147>
>                             Online hypothesis testing - A/B testing in
>                             the Spark Streaming framework
>                           * SPARK-9930
>                             <https://issues.apache.org/jira/browse/SPARK-9930>
>                             New feature transformers - ChiSqSelector,
>                             QuantileDiscretizer, SQL transformer
>                           * SPARK-6517
>                             <https://issues.apache.org/jira/browse/SPARK-6517>
>                             Bisecting K-Means clustering - Fast
>                             top-down clustering variant of K-Means
>
>
>                                 API improvements
>
>                           * ML Pipelines
>                               o SPARK-6725
>                                 <https://issues.apache.org/jira/browse/SPARK-6725>
>                                 Pipeline persistence - Save/load for
>                                 ML Pipelines, with partial coverage of
>                                 spark.ml <http://spark.ml/>algorithms
>                               o SPARK-5565
>                                 <https://issues.apache.org/jira/browse/SPARK-5565>
>                                 LDA in ML Pipelines - API for Latent
>                                 Dirichlet Allocation in ML Pipelines
>                           * R API
>                               o SPARK-9836
>                                 <https://issues.apache.org/jira/browse/SPARK-9836>
>                                 R-like statistics for GLMs - (Partial)
>                                 R-like stats for ordinary least
>                                 squares via summary(model)
>                               o SPARK-9681
>                                 <https://issues.apache.org/jira/browse/SPARK-9681>
>                                 Feature interactions in R formula -
>                                 Interaction operator ":" in R formula
>                           * Python API - Many improvements to Python
>                             API to approach feature parity
>
>
>                                 Misc improvements
>
>                           * SPARK-7685
>                             <https://issues.apache.org/jira/browse/SPARK-7685>,
>                             SPARK-9642
>                             <https://issues.apache.org/jira/browse/SPARK-9642>
>                             Instance weights for GLMs - Logistic and
>                             Linear Regression can take instance weights
>                           * SPARK-10384
>                             <https://issues.apache.org/jira/browse/SPARK-10384>,
>                             SPARK-10385
>                             <https://issues.apache.org/jira/browse/SPARK-10385>
>                             Univariate and bivariate statistics in
>                             DataFrames - Variance, stddev,
>                             correlations, etc.
>                           * SPARK-10117
>                             <https://issues.apache.org/jira/browse/SPARK-10117>
>                             LIBSVM data source - LIBSVM as a SQL data
>                             source
>
>
>                                     Documentation improvements
>
>                           * SPARK-7751
>                             <https://issues.apache.org/jira/browse/SPARK-7751>
>                             @since versions - Documentation includes
>                             initial version when classes and methods
>                             were added
>                           * SPARK-11337
>                             <https://issues.apache.org/jira/browse/SPARK-11337>
>                             Testable example code - Automated testing
>                             for code in user guide examples
>
>
>                             Deprecations
>
>                           * In spark.mllib.clustering.KMeans, the
>                             "runs" parameter has been deprecated.
>                           * In
>                             spark.ml.classification.LogisticRegressionModel
>                             and
>                             spark.ml.regression.LinearRegressionModel,
>                             the "weights" field has been deprecated,
>                             in favor of the new name "coefficients."
>                             This helps disambiguate from instance
>                             (row) weights given to algorithms.
>
>
>                             Changes of behavior
>
>                           * spark.mllib.tree.GradientBoostedTrees
>                             validationTol has changed semantics in
>                             1.6. Previously, it was a threshold for
>                             absolute change in error. Now, it
>                             resembles the behavior of GradientDescent
>                             convergenceTol: For large errors, it uses
>                             relative error (relative to the previous
>                             error); for small errors (< 0.01), it uses
>                             absolute error.
>                           * spark.ml.feature.RegexTokenizer:
>                             Previously, it did not convert strings to
>                             lowercase before tokenizing. Now, it
>                             converts to lowercase by default, with an
>                             option not to. This matches the behavior
>                             of the simpler Tokenizer transformer.
>                           * Spark SQL's partition discovery has been
>                             changed to only discover partition
>                             directories that are children of the given
>                             path. (i.e. if |path="/my/data/x=1"| then
>                             |x=1| will no longer be considered a
>                             partition but only children of |x=1|.)
>                             This behavior can be overridden by
>                             manually specifying the |basePath| that
>                             partitioning discovery should start with
>                             (SPARK-11678
>                             <https://issues.apache.org/jira/browse/SPARK-11678>).
>                           * When casting a value of an integral type
>                             to timestamp (e.g. casting a long value to
>                             timestamp), the value is treated as being
>                             in seconds instead of milliseconds
>                             (SPARK-11724
>                             <https://issues.apache.org/jira/browse/SPARK-11724>).
>                           * With the improved query planner for
>                             queries having distinct aggregations
>                             (SPARK-9241
>                             <https://issues.apache.org/jira/browse/SPARK-9241>),
>                             the plan of a query having a single
>                             distinct aggregation has been changed to a
>                             more robust version. To switch back to the
>                             plan generated by Spark 1.5's planner,
>                             please set
>                             |spark.sql.specializeSingleDistinctAggPlanning| to
>                             |true| (SPARK-12077
>                             <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
>
>
>
>                 -- 
>                 Best Regards
>
>                 Jeff Zhang
>
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Yin Huai <yh...@databricks.com>.

+1

On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee <de...@gmail.com> wrote:

> +1
>
> On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson <il...@gmail.com> wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen <jo...@databricks.com>
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 1.6.0!
>>>>>>
>>>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is *v1.6.0-rc4
>>>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>>>
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>>>
>>>>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>>>>> found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>>
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>>>
>>>>>> =======================================
>>>>>> == How can I help test this release? ==
>>>>>> =======================================
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>>
>>>>>> ================================================
>>>>>> == What justifies a -1 vote for this release? ==
>>>>>> ================================================
>>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>>>> block this release.
>>>>>>
>>>>>> ===============================================================
>>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>>> ===============================================================
>>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>>> into branch-1.6, since documentations will be published separately from the
>>>>>> release.
>>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>>> target version.
>>>>>>
>>>>>>
>>>>>> ==================================================
>>>>>> == Major changes to help you focus your testing ==
>>>>>> ==================================================
>>>>>>
>>>>>> Notable changes since 1.6 RC3
>>>>>>
>>>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>>>> Timestamps/Arrays/Decimal
>>>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>>>   - SPARK-12413 - Fix mesos HA
>>>>>>
>>>>>> Notable changes since 1.6 RC2
>>>>>> - SPARK_VERSION has been set correctly
>>>>>> - SPARK-12199 ML Docs are publishing correctly
>>>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>>>
>>>>>> Notable changes since 1.6 RC1
>>>>>> Spark Streaming
>>>>>>
>>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>>
>>>>>> Spark SQL
>>>>>>
>>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>>    bugs in eviction of storage memory by execution.
>>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>>>    passing null into ScalaUDF
>>>>>>
>>>>>> Notable Features Since 1.5Spark SQL
>>>>>>
>>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>>    schemas.
>>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>>    Session Management - Isolated devault database (i.e USE mydb)
>>>>>>    even on shared clusters.
>>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>>    Tungsten).
>>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>>>    Memory Management - Shared memory for execution and caching
>>>>>>    instead of exclusive division of the regions.
>>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>>    files of any supported format without registering a table.
>>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>>    basis for memory usage and spilled data size.
>>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>>    arbitrary numbers of columns
>>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>
>>>>>>    , SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up
>>>>>>    when caching data that contains complex types in DataFrames or SQL.
>>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>>    evaluating a pushed-down filter.
>>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>>>    query execution - Intial support for automatically selecting the
>>>>>>    number of reducers for joins and aggregations.
>>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>>>    query planner for queries having distinct aggregations - Query
>>>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>>>    high cardinality.
>>>>>>
>>>>>> Spark Streaming
>>>>>>
>>>>>>    - API Updates
>>>>>>       - SPARK-2629
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>>>       improved state management - mapWithState - a DStream
>>>>>>       transformation for stateful stream processing, supercedes
>>>>>>       updateStateByKey in functionality and performance.
>>>>>>       - SPARK-11198
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>>       use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
>>>>>>       records.
>>>>>>       - SPARK-10891
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>>       message handler function - Allows arbitraray function to be
>>>>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>>>>       what data is to be stored in memory.
>>>>>>       - SPARK-6328
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>>>       delays, batch processing times, etc.) in streaming.
>>>>>>
>>>>>>
>>>>>>    - UI Improvements
>>>>>>       - Made failures visible in the streaming tab, in the
>>>>>>       timelines, batch list, and batch details page.
>>>>>>       - Made output operations visible in the streaming tab as
>>>>>>       progress bars.
>>>>>>
>>>>>> MLlibNew algorithms/models
>>>>>>
>>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>>>    analysis - Log-linear model for survival analysis
>>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>>    equation for least squares - Normal equation solver, providing
>>>>>>    R-like model summary statistics
>>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>>    transformer
>>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>>
>>>>>> API improvements
>>>>>>
>>>>>>    - ML Pipelines
>>>>>>       - SPARK-6725
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>>>       persistence - Save/load for ML Pipelines, with partial
>>>>>>       coverage of spark.mlalgorithms
>>>>>>       - SPARK-5565
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-5565> LDA in ML
>>>>>>       Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
>>>>>>    - R API
>>>>>>       - SPARK-9836
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>>>       statistics for GLMs - (Partial) R-like stats for ordinary
>>>>>>       least squares via summary(model)
>>>>>>       - SPARK-9681
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>>>       interactions in R formula - Interaction operator ":" in R
>>>>>>       formula
>>>>>>    - Python API - Many improvements to Python API to approach
>>>>>>    feature parity
>>>>>>
>>>>>> Misc improvements
>>>>>>
>>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>>    instance weights
>>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>
>>>>>>    , SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>>    correlations, etc.
>>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>>>    data source - LIBSVM as a SQL data sourceDocumentation
>>>>>>    improvements
>>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>>    versions - Documentation includes initial version when classes
>>>>>>    and methods were added
>>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>>>    example code - Automated testing for code in user guide examples
>>>>>>
>>>>>> Deprecations
>>>>>>
>>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>>    deprecated.
>>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>>
>>>>>> Changes of behavior
>>>>>>
>>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>>>    Tokenizer transformer.
>>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>>    discover partition directories that are children of the given path. (i.e.
>>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>>    partition but only children of x=1.) This behavior can be
>>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>>    discovery should start with (SPARK-11678
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>>    - With the improved query planner for queries having distinct
>>>>>>    aggregations (SPARK-9241
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of
>>>>>>    a query having a single distinct aggregation has been changed to a more
>>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>>     to true (SPARK-12077
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Denny Lee <de...@gmail.com>.

+1

On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson <il...@gmail.com> wrote:

> +1
>
> On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen <jo...@databricks.com>
> wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 1.6.0!
>>>>>
>>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is *v1.6.0-rc4
>>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>>
>>>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>>>> found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>>
>>>>> =======================================
>>>>> == How can I help test this release? ==
>>>>> =======================================
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> ================================================
>>>>> == What justifies a -1 vote for this release? ==
>>>>> ================================================
>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>>> block this release.
>>>>>
>>>>> ===============================================================
>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>> ===============================================================
>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>> into branch-1.6, since documentations will be published separately from the
>>>>> release.
>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>> target version.
>>>>>
>>>>>
>>>>> ==================================================
>>>>> == Major changes to help you focus your testing ==
>>>>> ==================================================
>>>>>
>>>>> Notable changes since 1.6 RC3
>>>>>
>>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>>> Timestamps/Arrays/Decimal
>>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>>   - SPARK-12413 - Fix mesos HA
>>>>>
>>>>> Notable changes since 1.6 RC2
>>>>> - SPARK_VERSION has been set correctly
>>>>> - SPARK-12199 ML Docs are publishing correctly
>>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>>
>>>>> Notable changes since 1.6 RC1
>>>>> Spark Streaming
>>>>>
>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>
>>>>> Spark SQL
>>>>>
>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>    bugs in eviction of storage memory by execution.
>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>>    passing null into ScalaUDF
>>>>>
>>>>> Notable Features Since 1.5Spark SQL
>>>>>
>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>    schemas.
>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>>    on shared clusters.
>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>    Tungsten).
>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>>    Memory Management - Shared memory for execution and caching
>>>>>    instead of exclusive division of the regions.
>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>    files of any supported format without registering a table.
>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>    basis for memory usage and spilled data size.
>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>    arbitrary numbers of columns
>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>    evaluating a pushed-down filter.
>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>>    query execution - Intial support for automatically selecting the
>>>>>    number of reducers for joins and aggregations.
>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>>    query planner for queries having distinct aggregations - Query
>>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>>    high cardinality.
>>>>>
>>>>> Spark Streaming
>>>>>
>>>>>    - API Updates
>>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>        New improved state management - mapWithState - a DStream
>>>>>       transformation for stateful stream processing, supercedes
>>>>>       updateStateByKey in functionality and performance.
>>>>>       - SPARK-11198
>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>       use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
>>>>>       records.
>>>>>       - SPARK-10891
>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>       message handler function - Allows arbitraray function to be
>>>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>>>       what data is to be stored in memory.
>>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>>>>>        Python Streamng Listener API - Get streaming statistics
>>>>>       (scheduling delays, batch processing times, etc.) in streaming.
>>>>>
>>>>>
>>>>>    - UI Improvements
>>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>>       batch list, and batch details page.
>>>>>       - Made output operations visible in the streaming tab as
>>>>>       progress bars.
>>>>>
>>>>> MLlibNew algorithms/models
>>>>>
>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>>    analysis - Log-linear model for survival analysis
>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>    equation for least squares - Normal equation solver, providing
>>>>>    R-like model summary statistics
>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>    transformer
>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>
>>>>> API improvements
>>>>>
>>>>>    - ML Pipelines
>>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725>
>>>>>        Pipeline persistence - Save/load for ML Pipelines, with
>>>>>       partial coverage of spark.mlalgorithms
>>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565>
>>>>>        LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>>>>>       ML Pipelines
>>>>>    - R API
>>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836>
>>>>>        R-like statistics for GLMs - (Partial) R-like stats for
>>>>>       ordinary least squares via summary(model)
>>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681>
>>>>>        Feature interactions in R formula - Interaction operator ":"
>>>>>       in R formula
>>>>>    - Python API - Many improvements to Python API to approach feature
>>>>>    parity
>>>>>
>>>>> Misc improvements
>>>>>
>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>    instance weights
>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>    correlations, etc.
>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>    versions - Documentation includes initial version when classes and
>>>>>    methods were added
>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>>    example code - Automated testing for code in user guide examples
>>>>>
>>>>> Deprecations
>>>>>
>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>    deprecated.
>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>
>>>>> Changes of behavior
>>>>>
>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>>    Tokenizer transformer.
>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>    discover partition directories that are children of the given path. (i.e.
>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>    partition but only children of x=1.) This behavior can be
>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>    discovery should start with (SPARK-11678
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>    - With the improved query planner for queries having distinct
>>>>>    aggregations (SPARK-9241
>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>>    query having a single distinct aggregation has been changed to a more
>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>     to true (SPARK-12077
>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Aaron Davidson <il...@gmail.com>.

+1

On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen <jo...@databricks.com>
wrote:

> +1
>
> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> +1
>>
>> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc4
>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>
>>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Notable changes since 1.6 RC3
>>>>
>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>> Timestamps/Arrays/Decimal
>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>   - SPARK-12413 - Fix mesos HA
>>>>
>>>> Notable changes since 1.6 RC2
>>>> - SPARK_VERSION has been set correctly
>>>> - SPARK-12199 ML Docs are publishing correctly
>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>
>>>> Notable changes since 1.6 RC1
>>>> Spark Streaming
>>>>
>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>    trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>    bugs in eviction of storage memory by execution.
>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>    passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>    Performance - Improve Parquet scan performance when using flat
>>>>    schemas.
>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>    on shared clusters.
>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>    Tungsten).
>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>    Memory Management - Shared memory for execution and caching instead
>>>>    of exclusive division of the regions.
>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>    files of any supported format without registering a table.
>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>    basis for memory usage and spilled data size.
>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>    arbitrary numbers of columns
>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>>    pushed-down filter.
>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>    query execution - Intial support for automatically selecting the
>>>>    number of reducers for joins and aggregations.
>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>    query planner for queries having distinct aggregations - Query
>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>    high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>>    - API Updates
>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>       improved state management - mapWithState - a DStream
>>>>       transformation for stateful stream processing, supercedes
>>>>       updateStateByKey in functionality and performance.
>>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>>        Kinesis record deaggregation - Kinesis streams have been
>>>>       upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>       KPL-aggregated records.
>>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>>        Kinesis message handler function - Allows arbitraray function
>>>>       to be applied to a Kinesis record in the Kinesis receiver before to
>>>>       customize what data is to be stored in memory.
>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>       delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>>    - UI Improvements
>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>       batch list, and batch details page.
>>>>       - Made output operations visible in the streaming tab as
>>>>       progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>    analysis - Log-linear model for survival analysis
>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>    equation for least squares - Normal equation solver, providing
>>>>    R-like model summary statistics
>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>    transformer
>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>>    - ML Pipelines
>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>>       of spark.mlalgorithms
>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>>       Pipelines
>>>>    - R API
>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>>       squares via summary(model)
>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>       interactions in R formula - Interaction operator ":" in R formula
>>>>    - Python API - Many improvements to Python API to approach feature
>>>>    parity
>>>>
>>>> Misc improvements
>>>>
>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>>    weights
>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>    correlations, etc.
>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>    versions - Documentation includes initial version when classes and
>>>>    methods were added
>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>    example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>    deprecated.
>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>    for small errors (< 0.01), it uses absolute error.
>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>    Tokenizer transformer.
>>>>    - Spark SQL's partition discovery has been changed to only discover
>>>>    partition directories that are children of the given path. (i.e. if
>>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>>    partition but only children of x=1.) This behavior can be
>>>>    overridden by manually specifying the basePath that partitioning
>>>>    discovery should start with (SPARK-11678
>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>    casting a long value to timestamp), the value is treated as being in
>>>>    seconds instead of milliseconds (SPARK-11724
>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>    - With the improved query planner for queries having distinct
>>>>    aggregations (SPARK-9241
>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>    query having a single distinct aggregation has been changed to a more
>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>     to true (SPARK-12077
>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Josh Rosen <jo...@databricks.com>.

+1

On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang <zj...@gmail.com> wrote:

> +1
>
> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc4
>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>
>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Notable changes since 1.6 RC3
>>>
>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>> Timestamps/Arrays/Decimal
>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>   - SPARK-12413 - Fix mesos HA
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.mlalgorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Jeff Zhang <zj...@gmail.com>.

+1

On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> +1
>
> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc4
>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>
>> The test repository (versioned as v1.6.0-rc4) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC3
>>
>>   - SPARK-12404 - Fix serialization error for Datasets with
>> Timestamps/Arrays/Decimal
>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>   - SPARK-12413 - Fix mesos HA
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.mlalgorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Mark Hamstra <ma...@clearstorydata.com>.

+1

On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC3
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Ted Yu <yu...@gmail.com>.

Running test suite, there was timeout in hive-thriftserver module.

This has been fixed by SPARK-11823. So I assume this is test issue.

lgtm

On Tue, Dec 22, 2015 at 2:28 PM, Benjamin Fradet <be...@gmail.com>
wrote:

> +1
> On 22 Dec 2015 9:54 p.m., "Andrew Or" <an...@databricks.com> wrote:
>
>> +1
>>
>> 2015-12-22 12:43 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>
>>> +1
>>>
>>>
>>> On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> I'll kick the voting off with a +1.
>>>>
>>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 1.6.0!
>>>>>
>>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is *v1.6.0-rc4
>>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>>
>>>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>>>> found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>>
>>>>> =======================================
>>>>> == How can I help test this release? ==
>>>>> =======================================
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> ================================================
>>>>> == What justifies a -1 vote for this release? ==
>>>>> ================================================
>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>>> block this release.
>>>>>
>>>>> ===============================================================
>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>> ===============================================================
>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>> into branch-1.6, since documentations will be published separately from the
>>>>> release.
>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>> target version.
>>>>>
>>>>>
>>>>> ==================================================
>>>>> == Major changes to help you focus your testing ==
>>>>> ==================================================
>>>>>
>>>>> Notable changes since 1.6 RC3
>>>>>
>>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>>> Timestamps/Arrays/Decimal
>>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>>   - SPARK-12413 - Fix mesos HA
>>>>>
>>>>> Notable changes since 1.6 RC2
>>>>> - SPARK_VERSION has been set correctly
>>>>> - SPARK-12199 ML Docs are publishing correctly
>>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>>
>>>>> Notable changes since 1.6 RC1
>>>>> Spark Streaming
>>>>>
>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>
>>>>> Spark SQL
>>>>>
>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>    bugs in eviction of storage memory by execution.
>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>>    passing null into ScalaUDF
>>>>>
>>>>> Notable Features Since 1.5Spark SQL
>>>>>
>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>    schemas.
>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>>    on shared clusters.
>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>    Tungsten).
>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>>    Memory Management - Shared memory for execution and caching
>>>>>    instead of exclusive division of the regions.
>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>    files of any supported format without registering a table.
>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>    basis for memory usage and spilled data size.
>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>    arbitrary numbers of columns
>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>    evaluating a pushed-down filter.
>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>>    query execution - Intial support for automatically selecting the
>>>>>    number of reducers for joins and aggregations.
>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>>    query planner for queries having distinct aggregations - Query
>>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>>    high cardinality.
>>>>>
>>>>> Spark Streaming
>>>>>
>>>>>    - API Updates
>>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>        New improved state management - mapWithState - a DStream
>>>>>       transformation for stateful stream processing, supercedes
>>>>>       updateStateByKey in functionality and performance.
>>>>>       - SPARK-11198
>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>       use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
>>>>>       records.
>>>>>       - SPARK-10891
>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>       message handler function - Allows arbitraray function to be
>>>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>>>       what data is to be stored in memory.
>>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>>>>>        Python Streamng Listener API - Get streaming statistics
>>>>>       (scheduling delays, batch processing times, etc.) in streaming.
>>>>>
>>>>>
>>>>>    - UI Improvements
>>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>>       batch list, and batch details page.
>>>>>       - Made output operations visible in the streaming tab as
>>>>>       progress bars.
>>>>>
>>>>> MLlibNew algorithms/models
>>>>>
>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>>    analysis - Log-linear model for survival analysis
>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>    equation for least squares - Normal equation solver, providing
>>>>>    R-like model summary statistics
>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>    transformer
>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>
>>>>> API improvements
>>>>>
>>>>>    - ML Pipelines
>>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725>
>>>>>        Pipeline persistence - Save/load for ML Pipelines, with
>>>>>       partial coverage of spark.mlalgorithms
>>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565>
>>>>>        LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>>>>>       ML Pipelines
>>>>>    - R API
>>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836>
>>>>>        R-like statistics for GLMs - (Partial) R-like stats for
>>>>>       ordinary least squares via summary(model)
>>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681>
>>>>>        Feature interactions in R formula - Interaction operator ":"
>>>>>       in R formula
>>>>>    - Python API - Many improvements to Python API to approach feature
>>>>>    parity
>>>>>
>>>>> Misc improvements
>>>>>
>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>    instance weights
>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>    correlations, etc.
>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>    versions - Documentation includes initial version when classes and
>>>>>    methods were added
>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>>    example code - Automated testing for code in user guide examples
>>>>>
>>>>> Deprecations
>>>>>
>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>    deprecated.
>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>
>>>>> Changes of behavior
>>>>>
>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>>    Tokenizer transformer.
>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>    discover partition directories that are children of the given path. (i.e.
>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>    partition but only children of x=1.) This behavior can be
>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>    discovery should start with (SPARK-11678
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>    - With the improved query planner for queries having distinct
>>>>>    aggregations (SPARK-9241
>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>>    query having a single distinct aggregation has been changed to a more
>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>     to true (SPARK-12077
>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>
>>>>>
>>>>
>>>
>>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Benjamin Fradet <be...@gmail.com>.

+1
On 22 Dec 2015 9:54 p.m., "Andrew Or" <an...@databricks.com> wrote:

> +1
>
> 2015-12-22 12:43 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>
>> +1
>>
>>
>> On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>>> I'll kick the voting off with a +1.
>>>
>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc4
>>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>>
>>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Notable changes since 1.6 RC3
>>>>
>>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>>> Timestamps/Arrays/Decimal
>>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>>   - SPARK-12413 - Fix mesos HA
>>>>
>>>> Notable changes since 1.6 RC2
>>>> - SPARK_VERSION has been set correctly
>>>> - SPARK-12199 ML Docs are publishing correctly
>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>
>>>> Notable changes since 1.6 RC1
>>>> Spark Streaming
>>>>
>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>    trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>    bugs in eviction of storage memory by execution.
>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>    passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>    Performance - Improve Parquet scan performance when using flat
>>>>    schemas.
>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>    on shared clusters.
>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>    Tungsten).
>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>    Memory Management - Shared memory for execution and caching instead
>>>>    of exclusive division of the regions.
>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>    files of any supported format without registering a table.
>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>    basis for memory usage and spilled data size.
>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>    arbitrary numbers of columns
>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>>    pushed-down filter.
>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>    query execution - Intial support for automatically selecting the
>>>>    number of reducers for joins and aggregations.
>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>    query planner for queries having distinct aggregations - Query
>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>    high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>>    - API Updates
>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>       improved state management - mapWithState - a DStream
>>>>       transformation for stateful stream processing, supercedes
>>>>       updateStateByKey in functionality and performance.
>>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>>        Kinesis record deaggregation - Kinesis streams have been
>>>>       upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>       KPL-aggregated records.
>>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>>        Kinesis message handler function - Allows arbitraray function
>>>>       to be applied to a Kinesis record in the Kinesis receiver before to
>>>>       customize what data is to be stored in memory.
>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>       delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>>    - UI Improvements
>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>       batch list, and batch details page.
>>>>       - Made output operations visible in the streaming tab as
>>>>       progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>    analysis - Log-linear model for survival analysis
>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>    equation for least squares - Normal equation solver, providing
>>>>    R-like model summary statistics
>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>    transformer
>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>>    - ML Pipelines
>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>>       of spark.mlalgorithms
>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>>       Pipelines
>>>>    - R API
>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>>       squares via summary(model)
>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>       interactions in R formula - Interaction operator ":" in R formula
>>>>    - Python API - Many improvements to Python API to approach feature
>>>>    parity
>>>>
>>>> Misc improvements
>>>>
>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>>    weights
>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>    correlations, etc.
>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>    versions - Documentation includes initial version when classes and
>>>>    methods were added
>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>    example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>    deprecated.
>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>    for small errors (< 0.01), it uses absolute error.
>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>    Tokenizer transformer.
>>>>    - Spark SQL's partition discovery has been changed to only discover
>>>>    partition directories that are children of the given path. (i.e. if
>>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>>    partition but only children of x=1.) This behavior can be
>>>>    overridden by manually specifying the basePath that partitioning
>>>>    discovery should start with (SPARK-11678
>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>    casting a long value to timestamp), the value is treated as being in
>>>>    seconds instead of milliseconds (SPARK-11724
>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>    - With the improved query planner for queries having distinct
>>>>    aggregations (SPARK-9241
>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>    query having a single distinct aggregation has been changed to a more
>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>     to true (SPARK-12077
>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Andrew Or <an...@databricks.com>.

+1

2015-12-22 12:43 GMT-08:00 Reynold Xin <rx...@databricks.com>:

> +1
>
>
> On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> I'll kick the voting off with a +1.
>>
>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>> michael@databricks.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc4
>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>
>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Notable changes since 1.6 RC3
>>>
>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>> Timestamps/Arrays/Decimal
>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>   - SPARK-12413 - Fix mesos HA
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.mlalgorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Reynold Xin <rx...@databricks.com>.

+1


On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> I'll kick the voting off with a +1.
>
> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc4
>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>
>> The test repository (versioned as v1.6.0-rc4) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC3
>>
>>   - SPARK-12404 - Fix serialization error for Datasets with
>> Timestamps/Arrays/Decimal
>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>   - SPARK-12413 - Fix mesos HA
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.mlalgorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

Posted by Michael Armbrust <mi...@databricks.com>.

I'll kick the voting off with a +1.

On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> <https://github.com/apache/spark/tree/v1.6.0-rc4>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC3
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>