You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2015/12/12 18:39:21 UTC

[VOTE] Release Apache Spark 1.6.0 (RC2)

Please vote on releasing the following candidate as Apache Spark version
1.6.0!

The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is *v1.6.0-rc2
(23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
<https://github.com/apache/spark/tree/v1.6.0-rc2>*

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1169/

The test repository (versioned as v1.6.0-rc2) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1168/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/

=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.

===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.


==================================================
== Major changes to help you focus your testing ==
==================================================

Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming

   - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
   trackStateByKey has been renamed to mapWithState

Spark SQL

   - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
   SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs
   in eviction of storage memory by execution.
   - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
   passing null into ScalaUDF

Notable Features Since 1.5Spark SQL

   - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
   Performance - Improve Parquet scan performance when using flat schemas.
   - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
   Session Management - Isolated devault database (i.e USE mydb) even on
   shared clusters.
   - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
   API - A type-safe API (similar to RDDs) that performs many operations on
   serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
   Memory Management - Shared memory for execution and caching instead of
   exclusive division of the regions.
   - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
   Queries on Files - Concise syntax for running SQL queries over files of
   any supported format without registering a table.
   - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
   non-standard JSON files - Added options to read non-standard JSON files
   (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator
   Metrics for SQL Execution - Display statistics on a peroperator basis
   for memory usage and spilled data size.
   - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
   (*) expansion for StructTypes - Makes it easier to nest and unest
   arbitrary numbers of columns
   - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
   SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
   Columnar Cache Performance - Significant (up to 14x) speed up when
   caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
   null-safe joins - Joins using null-safe equality (<=>) will now execute
   using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
   Execution Using Off-Heap Memory - Support for configuring query
   execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
   API Avoid Double Filter - When implemeting a datasource with filter
   pushdown, developers can now tell Spark SQL to avoid double evaluating a
   pushed-down filter.
   - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
   Layout of Cached Data - storing partitioning and ordering schemes in
   In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
   query execution - Intial support for automatically selecting the number
   of reducers for joins and aggregations.
   - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
   query planner for queries having distinct aggregations - Query plans of
   distinct aggregations are more robust when distinct columns have high
   cardinality.

Spark Streaming

   - API Updates
      - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
      improved state management - mapWithState - a DStream transformation
      for stateful stream processing, supercedes updateStateByKey in
      functionality and performance.
      - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
      record deaggregation - Kinesis streams have been upgraded to use KCL
      1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
      message handler function - Allows arbitraray function to be applied
      to a Kinesis record in the Kinesis receiver before to customize what data
      is to be stored in memory.
      - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
      Streamng Listener API - Get streaming statistics (scheduling delays,
      batch processing times, etc.) in streaming.


   - UI Improvements
      - Made failures visible in the streaming tab, in the timelines, batch
      list, and batch details page.
      - Made output operations visible in the streaming tab as progress
      bars.

MLlibNew algorithms/models

   - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
   analysis - Log-linear model for survival analysis
   - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
   equation for least squares - Normal equation solver, providing R-like
   model summary statistics
   - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
   hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
   feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
   transformer
   - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
   K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

   - ML Pipelines
      - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
      persistence - Save/load for ML Pipelines, with partial coverage of
      spark.ml algorithms
      - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
      in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
   - R API
      - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
      statistics for GLMs - (Partial) R-like stats for ordinary least
      squares via summary(model)
      - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
      interactions in R formula - Interaction operator ":" in R formula
   - Python API - Many improvements to Python API to approach feature parity

Misc improvements

   - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
   SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
   weights for GLMs - Logistic and Linear Regression can take instance
   weights
   - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
   SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
   and bivariate statistics in DataFrames - Variance, stddev, correlations,
   etc.
   - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
   data source - LIBSVM as a SQL data sourceDocumentation improvements
   - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
   versions - Documentation includes initial version when classes and
   methods were added
   - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
   example code - Automated testing for code in user guide examples

Deprecations

   - In spark.mllib.clustering.KMeans, the "runs" parameter has been
   deprecated.
   - In spark.ml.classification.LogisticRegressionModel and
   spark.ml.regression.LinearRegressionModel, the "weights" field has been
   deprecated, in favor of the new name "coefficients." This helps
   disambiguate from instance (row) weights given to algorithms.

Changes of behavior

   - spark.mllib.tree.GradientBoostedTrees validationTol has changed
   semantics in 1.6. Previously, it was a threshold for absolute change in
   error. Now, it resembles the behavior of GradientDescent convergenceTol:
   For large errors, it uses relative error (relative to the previous error);
   for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert
   strings to lowercase before tokenizing. Now, it converts to lowercase by
   default, with an option not to. This matches the behavior of the simpler
   Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover
   partition directories that are children of the given path. (i.e. if
   path="/my/data/x=1" then x=1 will no longer be considered a partition
   but only children of x=1.) This behavior can be overridden by manually
   specifying the basePath that partitioning discovery should start with (
   SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
   - When casting a value of an integral type to timestamp (e.g. casting a
   long value to timestamp), the value is treated as being in seconds instead
   of milliseconds (SPARK-11724
   <https://issues.apache.org/jira/browse/SPARK-11724>).
   - With the improved query planner for queries having distinct
   aggregations (SPARK-9241
   <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a query
   having a single distinct aggregation has been changed to a more robust
   version. To switch back to the plan generated by Spark 1.5's planner,
   please set spark.sql.specializeSingleDistinctAggPlanning to true (
   SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

With Java 7 / Ubuntu 15, and "-Pyarn -Phadoop-2.6 -Phive
-Phive-thriftserver", I still see the Docker tests fail every time. Is
anyone else seeing them fail (or running them)?

The Hive CliSuite also fails (stack trace at the bottom).

Same deal -- if people are running this test and it's not failing,
this is probably just flakiness of some form.

There's the aforementioned doc generation issue too.

Other than that it compiled and ran all tests for me.

JIRA score: 28 issues, of which 11 bugs, of which 5 critical (listed
below), of which 0 blockers. OK there.



Critical bugs:
SPARK-8447 Test external shuffle service with all shuffle managers
SPARK-10680 Flaky test:
network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests
SPARK-11224 Flaky test: o.a.s.ExternalShuffleServiceSuite
SPARK-11266 Peak memory tests swallow failures
SPARK-11293 Spillable collections leak shuffle memory



- Simple commands *** FAILED ***
  =======================
  CliSuite failure output
  =======================
  Spark SQL CLI command line: ../../bin/spark-sql --master local
--driver-java-options -Dderby.system.durability=test --conf
spark.ui.enabled=false --hiveconf
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/srowen/spark-1.6.0/sql/hive-thriftserver/target/tmp/spark-240e9e22-8fe8-408b-a116-2a894b3cbf1f;create=true
--hiveconf hive.metastore.warehouse.dir=/home/srowen/spark-1.6.0/sql/hive-thriftserver/target/tmp/spark-c336bc67-8e51-4284-b574-e8b79d0d4fce
--hiveconf hive.exec.scratchdir=/home/srowen/spark-1.6.0/sql/hive-thriftserver/target/tmp/spark-3a4f9564-d9f1-467f-8016-d4c95389e568
  Exception: java.util.concurrent.TimeoutException: Futures timed out
after [3 minutes]
  Executed query 0 "CREATE TABLE hive_test(key INT, val STRING);",
  But failed to capture expected output "OK" within 3 minutes.

  2015-12-14 13:47:23.07 - stderr> SLF4J: Class path contains multiple
SLF4J bindings.
  2015-12-14 13:47:23.07 - stderr> SLF4J: Found binding in
[jar:file:/home/srowen/spark-1.6.0/assembly/target/scala-2.10/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  2015-12-14 13:47:23.07 - stderr> SLF4J: Found binding in
[jar:file:/home/srowen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  2015-12-14 13:47:23.07 - stderr> SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  2015-12-14 13:47:23.074 - stderr> SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory]
  2015-12-14 13:47:39.36 - stdout> SET spark.sql.hive.version=1.2.1
  ===========================
  End CliSuite failure output
  =========================== (CliSuite.scala:151)




On Sat, Dec 12, 2015 at 5:39 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark 1.6.0 Preview
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858  Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241  Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629  New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328  Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518  Survival analysis - Log-linear model for survival analysis
> SPARK-9834  Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.ml algorithms
> SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681  Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751  @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Yin Huai <yh...@databricks.com>.

+1

Critical and blocker issues of SQL have been addressed.

On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> I'll kick off the voting with a +1.
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Michael Armbrust <mi...@databricks.com>.

>
> I'm surprised you're suggesting there's not a coupling between a release's
> code and the docs for that release. If a release happens and some time
> later docs come out, that has some effect on people's usage.
>

I'm only suggesting that we shouldn't delay testing of the actual bits, or
wait to iterate on another RC.  Ideally docs should come out with the
actual release announcement (and I'll do everything in my power to make
this happen).  The should also be updated regularly as small issues are
found.

But if it can/will be fixed quickly, what's the hurry? I get it, people
> want a releases sooner than later all else equal, but this is always true.
> It'd be nice to talk about what behaviors have led to being behind schedule
> and this perceived rush to finish now, since this same thing has happened
> in 1.5, 1.4. I'd rather at least collect some opinions on it than
> invalidate the question.
>

I'm happy to debate concrete process suggestions on another thread.

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

(I can't -1 this.) I do agree that docs have been treated as if separate
from releases in the past. With more maturity in the release process, I'm
questioning that now, as I don't think it's normal. It would be a reason to
release or not release this particular tarball, so a vote thread is the
right place to discuss it.

I'm surprised you're suggesting there's not a coupling between a release's
code and the docs for that release. If a release happens and some time
later docs come out, that has some effect on people's usage. Surely, the
ideal is for docs for x.y to come from the bits for x.y, and thus are
available at the same time.

Reality is something else, and your argument is practical, that the release
is again behind and so shouldn't we overlook this minor problem to get it
out? This particular problem has to get fixed, soon, we agree. It's minor
by virtue of being hopefully temporary.

But if it can/will be fixed quickly, what's the hurry? I get it, people
want a releases sooner than later all else equal, but this is always true.
It'd be nice to talk about what behaviors have led to being behind schedule
and this perceived rush to finish now, since this same thing has happened
in 1.5, 1.4. I'd rather at least collect some opinions on it than
invalidate the question.

On Sat, Dec 12, 2015 at 11:17 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Sean, if you would like to -1 the release you are certainly entitled to,
> but in the past we have never held a release for documentation only
> issues.  If you'd like to change the policy of the project I'm not sure
> that a voting thread is the right place to do it.
>
> I think the right question here, is "How are users going to be affected by
> this temporary issue?".  Given that I'm pretty certain that no users build
> the documentation from the release themselves and instead consume it from
> the published documentation, the docs contained in the release seem less
> important as far as voting on the artifacts is concerned.
>
> In contrast, there have been several threads on the users list asking when
> the release is going to happen.  Should we make them wait longer for
> something that isn't going to affect their usage of the release?  I would
> vote no.  That doesn't mean that we shouldn't fix the documentation issue.
> It just means we shouldn't add unnecessary coupling where it has no benefit.
>
> On Sat, Dec 12, 2015 at 1:50 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> I've heard this argument before, but don't quite get it. Documentation is
>> part of a release, and I believe is something we're voting on here too, and
>> therefore needs to 'work' as documentation. We could not release this HTML
>> to the Apache site, so I think that does actually mean the artifacts
>> including docs don't work as a release.
>>
>> Yes, I can see that the non-code artifacts can be released a little bit
>> after the code artifacts with last minute fixes. But, the whole release can
>> just happen later too. Why wouldn't this be a valid reason to block the
>> release?
>>
>> On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Thanks Ben, but as I said in the first email, docs are published
>>> separately from the release, so this isn't a valid reason to down vote the
>>> RC.  We just provide them to help with testing.
>>>
>>> I'll ask the mllib guys to take a look at that patch though.
>>> On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <be...@gmail.com>
>>> wrote:
>>>
>>>> -1
>>>>
>>>> For me the docs are not displaying except for the first page, for
>>>> example
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html is
>>>> a blank page.
>>>> This is because of SPARK-12199
>>>> <https://github.com/apache/spark/pull/10193>:
>>>> Element[W|w]iseProductExample.scala is not the same in the docs and
>>>> the actual file name.
>>>>
>>>> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> I'll kick off the voting with a +1.
>>>>>
>>>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 1.6.0!
>>>>>>
>>>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is *v1.6.0-rc2
>>>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>>>>
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>>>>
>>>>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>>>>> found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>>
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>>>>
>>>>>> =======================================
>>>>>> == How can I help test this release? ==
>>>>>> =======================================
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>>
>>>>>> ================================================
>>>>>> == What justifies a -1 vote for this release? ==
>>>>>> ================================================
>>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>>>> block this release.
>>>>>>
>>>>>> ===============================================================
>>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>>> ===============================================================
>>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>>> into branch-1.6, since documentations will be published separately from the
>>>>>> release.
>>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>>> target version.
>>>>>>
>>>>>>
>>>>>> ==================================================
>>>>>> == Major changes to help you focus your testing ==
>>>>>> ==================================================
>>>>>>
>>>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>>>>
>>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>>
>>>>>> Spark SQL
>>>>>>
>>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>>    bugs in eviction of storage memory by execution.
>>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>>>    passing null into ScalaUDF
>>>>>>
>>>>>> Notable Features Since 1.5Spark SQL
>>>>>>
>>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>>    schemas.
>>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>>    Session Management - Isolated devault database (i.e USE mydb)
>>>>>>    even on shared clusters.
>>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>>    Tungsten).
>>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>>>    Memory Management - Shared memory for execution and caching
>>>>>>    instead of exclusive division of the regions.
>>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>>    files of any supported format without registering a table.
>>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>>    basis for memory usage and spilled data size.
>>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>>    arbitrary numbers of columns
>>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>
>>>>>>    , SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up
>>>>>>    when caching data that contains complex types in DataFrames or SQL.
>>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>>    evaluating a pushed-down filter.
>>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>>>    query execution - Intial support for automatically selecting the
>>>>>>    number of reducers for joins and aggregations.
>>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>>>    query planner for queries having distinct aggregations - Query
>>>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>>>    high cardinality.
>>>>>>
>>>>>> Spark Streaming
>>>>>>
>>>>>>    - API Updates
>>>>>>       - SPARK-2629
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>>>       improved state management - mapWithState - a DStream
>>>>>>       transformation for stateful stream processing, supercedes
>>>>>>       updateStateByKey in functionality and performance.
>>>>>>       - SPARK-11198
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>>       use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
>>>>>>       records.
>>>>>>       - SPARK-10891
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>>       message handler function - Allows arbitraray function to be
>>>>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>>>>       what data is to be stored in memory.
>>>>>>       - SPARK-6328
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>>>       delays, batch processing times, etc.) in streaming.
>>>>>>
>>>>>>
>>>>>>    - UI Improvements
>>>>>>       - Made failures visible in the streaming tab, in the
>>>>>>       timelines, batch list, and batch details page.
>>>>>>       - Made output operations visible in the streaming tab as
>>>>>>       progress bars.
>>>>>>
>>>>>> MLlibNew algorithms/models
>>>>>>
>>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>>>    analysis - Log-linear model for survival analysis
>>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>>    equation for least squares - Normal equation solver, providing
>>>>>>    R-like model summary statistics
>>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>>    transformer
>>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>>
>>>>>> API improvements
>>>>>>
>>>>>>    - ML Pipelines
>>>>>>       - SPARK-6725
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>>>       persistence - Save/load for ML Pipelines, with partial
>>>>>>       coverage of spark.ml algorithms
>>>>>>       - SPARK-5565
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-5565> LDA in ML
>>>>>>       Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
>>>>>>    - R API
>>>>>>       - SPARK-9836
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>>>       statistics for GLMs - (Partial) R-like stats for ordinary
>>>>>>       least squares via summary(model)
>>>>>>       - SPARK-9681
>>>>>>       <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>>>       interactions in R formula - Interaction operator ":" in R
>>>>>>       formula
>>>>>>    - Python API - Many improvements to Python API to approach
>>>>>>    feature parity
>>>>>>
>>>>>> Misc improvements
>>>>>>
>>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>>    instance weights
>>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>
>>>>>>    , SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>>    correlations, etc.
>>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>>>    data source - LIBSVM as a SQL data sourceDocumentation
>>>>>>    improvements
>>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>>    versions - Documentation includes initial version when classes
>>>>>>    and methods were added
>>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>>>    example code - Automated testing for code in user guide examples
>>>>>>
>>>>>> Deprecations
>>>>>>
>>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>>    deprecated.
>>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>>
>>>>>> Changes of behavior
>>>>>>
>>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>>>    Tokenizer transformer.
>>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>>    discover partition directories that are children of the given path. (i.e.
>>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>>    partition but only children of x=1.) This behavior can be
>>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>>    discovery should start with (SPARK-11678
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>>    - With the improved query planner for queries having distinct
>>>>>>    aggregations (SPARK-9241
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of
>>>>>>    a query having a single distinct aggregation has been changed to a more
>>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>>     to true (SPARK-12077
>>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ben Fradet.
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Michael Armbrust <mi...@databricks.com>.

Sean, if you would like to -1 the release you are certainly entitled to,
but in the past we have never held a release for documentation only
issues.  If you'd like to change the policy of the project I'm not sure
that a voting thread is the right place to do it.

I think the right question here, is "How are users going to be affected by
this temporary issue?".  Given that I'm pretty certain that no users build
the documentation from the release themselves and instead consume it from
the published documentation, the docs contained in the release seem less
important as far as voting on the artifacts is concerned.

In contrast, there have been several threads on the users list asking when
the release is going to happen.  Should we make them wait longer for
something that isn't going to affect their usage of the release?  I would
vote no.  That doesn't mean that we shouldn't fix the documentation issue.
It just means we shouldn't add unnecessary coupling where it has no benefit.

On Sat, Dec 12, 2015 at 1:50 PM, Sean Owen <so...@cloudera.com> wrote:

> I've heard this argument before, but don't quite get it. Documentation is
> part of a release, and I believe is something we're voting on here too, and
> therefore needs to 'work' as documentation. We could not release this HTML
> to the Apache site, so I think that does actually mean the artifacts
> including docs don't work as a release.
>
> Yes, I can see that the non-code artifacts can be released a little bit
> after the code artifacts with last minute fixes. But, the whole release can
> just happen later too. Why wouldn't this be a valid reason to block the
> release?
>
> On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Thanks Ben, but as I said in the first email, docs are published
>> separately from the release, so this isn't a valid reason to down vote the
>> RC.  We just provide them to help with testing.
>>
>> I'll ask the mllib guys to take a look at that patch though.
>> On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <be...@gmail.com>
>> wrote:
>>
>>> -1
>>>
>>> For me the docs are not displaying except for the first page, for
>>> example
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html is
>>> a blank page.
>>> This is because of SPARK-12199
>>> <https://github.com/apache/spark/pull/10193>:
>>> Element[W|w]iseProductExample.scala is not the same in the docs and the
>>> actual file name.
>>>
>>> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> I'll kick off the voting with a +1.
>>>>
>>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 1.6.0!
>>>>>
>>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is *v1.6.0-rc2
>>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>>>
>>>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>>>> found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>>>
>>>>> =======================================
>>>>> == How can I help test this release? ==
>>>>> =======================================
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> ================================================
>>>>> == What justifies a -1 vote for this release? ==
>>>>> ================================================
>>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>>> block this release.
>>>>>
>>>>> ===============================================================
>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>> ===============================================================
>>>>> 1. It is OK for documentation patches to target 1.6.0 and still go
>>>>> into branch-1.6, since documentations will be published separately from the
>>>>> release.
>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>>> target version.
>>>>>
>>>>>
>>>>> ==================================================
>>>>> == Major changes to help you focus your testing ==
>>>>> ==================================================
>>>>>
>>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>>>
>>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>    trackStateByKey has been renamed to mapWithState
>>>>>
>>>>> Spark SQL
>>>>>
>>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>>    bugs in eviction of storage memory by execution.
>>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>>    passing null into ScalaUDF
>>>>>
>>>>> Notable Features Since 1.5Spark SQL
>>>>>
>>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>>    Performance - Improve Parquet scan performance when using flat
>>>>>    schemas.
>>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>>    on shared clusters.
>>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>>    Tungsten).
>>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>>    Memory Management - Shared memory for execution and caching
>>>>>    instead of exclusive division of the regions.
>>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>>    files of any supported format without registering a table.
>>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>>    basis for memory usage and spilled data size.
>>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>>    arbitrary numbers of columns
>>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>>    API Avoid Double Filter - When implemeting a datasource with
>>>>>    filter pushdown, developers can now tell Spark SQL to avoid double
>>>>>    evaluating a pushed-down filter.
>>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>>    query execution - Intial support for automatically selecting the
>>>>>    number of reducers for joins and aggregations.
>>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>>    query planner for queries having distinct aggregations - Query
>>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>>    high cardinality.
>>>>>
>>>>> Spark Streaming
>>>>>
>>>>>    - API Updates
>>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>>        New improved state management - mapWithState - a DStream
>>>>>       transformation for stateful stream processing, supercedes
>>>>>       updateStateByKey in functionality and performance.
>>>>>       - SPARK-11198
>>>>>       <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>>>       record deaggregation - Kinesis streams have been upgraded to
>>>>>       use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated
>>>>>       records.
>>>>>       - SPARK-10891
>>>>>       <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>>>       message handler function - Allows arbitraray function to be
>>>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>>>       what data is to be stored in memory.
>>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328>
>>>>>        Python Streamng Listener API - Get streaming statistics
>>>>>       (scheduling delays, batch processing times, etc.) in streaming.
>>>>>
>>>>>
>>>>>    - UI Improvements
>>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>>       batch list, and batch details page.
>>>>>       - Made output operations visible in the streaming tab as
>>>>>       progress bars.
>>>>>
>>>>> MLlibNew algorithms/models
>>>>>
>>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>>    analysis - Log-linear model for survival analysis
>>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>>    equation for least squares - Normal equation solver, providing
>>>>>    R-like model summary statistics
>>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>>    transformer
>>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>>
>>>>> API improvements
>>>>>
>>>>>    - ML Pipelines
>>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725>
>>>>>        Pipeline persistence - Save/load for ML Pipelines, with
>>>>>       partial coverage of spark.ml algorithms
>>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565>
>>>>>        LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>>>>>       ML Pipelines
>>>>>    - R API
>>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836>
>>>>>        R-like statistics for GLMs - (Partial) R-like stats for
>>>>>       ordinary least squares via summary(model)
>>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681>
>>>>>        Feature interactions in R formula - Interaction operator ":"
>>>>>       in R formula
>>>>>    - Python API - Many improvements to Python API to approach feature
>>>>>    parity
>>>>>
>>>>> Misc improvements
>>>>>
>>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>>    weights for GLMs - Logistic and Linear Regression can take
>>>>>    instance weights
>>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>>    correlations, etc.
>>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>>    versions - Documentation includes initial version when classes and
>>>>>    methods were added
>>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>>    example code - Automated testing for code in user guide examples
>>>>>
>>>>> Deprecations
>>>>>
>>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>>    deprecated.
>>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>>
>>>>> Changes of behavior
>>>>>
>>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>>    for small errors (< 0.01), it uses absolute error.
>>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>>    Tokenizer transformer.
>>>>>    - Spark SQL's partition discovery has been changed to only
>>>>>    discover partition directories that are children of the given path. (i.e.
>>>>>    if path="/my/data/x=1" then x=1 will no longer be considered a
>>>>>    partition but only children of x=1.) This behavior can be
>>>>>    overridden by manually specifying the basePath that partitioning
>>>>>    discovery should start with (SPARK-11678
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>>    casting a long value to timestamp), the value is treated as being in
>>>>>    seconds instead of milliseconds (SPARK-11724
>>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>>    - With the improved query planner for queries having distinct
>>>>>    aggregations (SPARK-9241
>>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>>    query having a single distinct aggregation has been changed to a more
>>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>>     to true (SPARK-12077
>>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Ben Fradet.
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Sean Owen <so...@cloudera.com>.

I've heard this argument before, but don't quite get it. Documentation is
part of a release, and I believe is something we're voting on here too, and
therefore needs to 'work' as documentation. We could not release this HTML
to the Apache site, so I think that does actually mean the artifacts
including docs don't work as a release.

Yes, I can see that the non-code artifacts can be released a little bit
after the code artifacts with last minute fixes. But, the whole release can
just happen later too. Why wouldn't this be a valid reason to block the
release?

On Sat, Dec 12, 2015 at 6:31 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Thanks Ben, but as I said in the first email, docs are published
> separately from the release, so this isn't a valid reason to down vote the
> RC.  We just provide them to help with testing.
>
> I'll ask the mllib guys to take a look at that patch though.
> On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <be...@gmail.com>
> wrote:
>
>> -1
>>
>> For me the docs are not displaying except for the first page, for example
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html is
>> a blank page.
>> This is because of SPARK-12199
>> <https://github.com/apache/spark/pull/10193>:
>> Element[W|w]iseProductExample.scala is not the same in the docs and the
>> actual file name.
>>
>> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> I'll kick off the voting with a +1.
>>>
>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc2
>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>>
>>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>>
>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>    trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>    bugs in eviction of storage memory by execution.
>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>    passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>    Performance - Improve Parquet scan performance when using flat
>>>>    schemas.
>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>    on shared clusters.
>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>    Tungsten).
>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>    Memory Management - Shared memory for execution and caching instead
>>>>    of exclusive division of the regions.
>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>    files of any supported format without registering a table.
>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>    basis for memory usage and spilled data size.
>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>    arbitrary numbers of columns
>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>>    pushed-down filter.
>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>    query execution - Intial support for automatically selecting the
>>>>    number of reducers for joins and aggregations.
>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>    query planner for queries having distinct aggregations - Query
>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>    high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>>    - API Updates
>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>       improved state management - mapWithState - a DStream
>>>>       transformation for stateful stream processing, supercedes
>>>>       updateStateByKey in functionality and performance.
>>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>>        Kinesis record deaggregation - Kinesis streams have been
>>>>       upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>       KPL-aggregated records.
>>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>>        Kinesis message handler function - Allows arbitraray function
>>>>       to be applied to a Kinesis record in the Kinesis receiver before to
>>>>       customize what data is to be stored in memory.
>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>       delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>>    - UI Improvements
>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>       batch list, and batch details page.
>>>>       - Made output operations visible in the streaming tab as
>>>>       progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>    analysis - Log-linear model for survival analysis
>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>    equation for least squares - Normal equation solver, providing
>>>>    R-like model summary statistics
>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>    transformer
>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>>    - ML Pipelines
>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>>       of spark.ml algorithms
>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>>       Pipelines
>>>>    - R API
>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>>       squares via summary(model)
>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>       interactions in R formula - Interaction operator ":" in R formula
>>>>    - Python API - Many improvements to Python API to approach feature
>>>>    parity
>>>>
>>>> Misc improvements
>>>>
>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>>    weights
>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>    correlations, etc.
>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>    versions - Documentation includes initial version when classes and
>>>>    methods were added
>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>    example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>    deprecated.
>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>    for small errors (< 0.01), it uses absolute error.
>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>    Tokenizer transformer.
>>>>    - Spark SQL's partition discovery has been changed to only discover
>>>>    partition directories that are children of the given path. (i.e. if
>>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>>    partition but only children of x=1.) This behavior can be
>>>>    overridden by manually specifying the basePath that partitioning
>>>>    discovery should start with (SPARK-11678
>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>    casting a long value to timestamp), the value is treated as being in
>>>>    seconds instead of milliseconds (SPARK-11724
>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>    - With the improved query planner for queries having distinct
>>>>    aggregations (SPARK-9241
>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>    query having a single distinct aggregation has been changed to a more
>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>     to true (SPARK-12077
>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Ben Fradet.
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Michael Armbrust <mi...@databricks.com>.

Thanks Ben, but as I said in the first email, docs are published separately
from the release, so this isn't a valid reason to down vote the RC.  We
just provide them to help with testing.

I'll ask the mllib guys to take a look at that patch though.
On Dec 12, 2015 9:44 AM, "Benjamin Fradet" <be...@gmail.com>
wrote:

> -1
>
> For me the docs are not displaying except for the first page, for example
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html is
> a blank page.
> This is because of SPARK-12199
> <https://github.com/apache/spark/pull/10193>:
> Element[W|w]iseProductExample.scala is not the same in the docs and the
> actual file name.
>
> On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> I'll kick off the voting with a +1.
>>
>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc2
>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>
>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.ml algorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>>
>>
>
>
> --
> Ben Fradet.
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Benjamin Fradet <be...@gmail.com>.

-1

For me the docs are not displaying except for the first page, for example
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html
is
a blank page.
This is because of SPARK-12199 <https://github.com/apache/spark/pull/10193>
: Element[W|w]iseProductExample.scala is not the same in the docs and the
actual file name.

On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> I'll kick off the voting with a +1.
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>>
>


-- 
Ben Fradet.

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Michael Armbrust <mi...@databricks.com>.

I'll kick off the voting with a +1.

On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Ricardo Almeida <ri...@actnowib.com>.

+1 (non binding)

Tested our workloads on a standalone cluster:
- Spark Core
- Spark SQL
- Spark MLlib
- Python API



On 12 December 2015 at 18:39, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Iulian Dragoș <iu...@typesafe.com>.

-1 (non-binding)

Cluster mode on Mesos is broken (regression compared to 1.5.2). It seems to
be related to the way SPARK_HOME is handled. In the driver logs I see:

I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave
130bdc39-44e7-4256-8c22-602040d337f1-S1
bin/spark-submit: line 27:
/Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
No such file or directory

The path is my local SPARK_HOME, but that’s of course not the one in the
Mesos slave.

iulian

On Tue, Dec 15, 2015 at 6:31 AM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

I'm afraid you're correct, Krishna:
>
> core/src/main/scala/org/apache/spark/package.scala:  val SPARK_VERSION =
> "1.6.0-SNAPSHOT"
> docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT
>
> On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar <ks...@gmail.com>
> wrote:
>
>> Guys,
>>    The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you
>> pl verify ?
>> Cheers
>> <k/>
>>
>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc2
>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>
>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.ml algorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>>
>>
> 
-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Iulian Dragoș <iu...@typesafe.com>.

Thanks for the heads up.

On Tue, Dec 15, 2015 at 11:40 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> This vote is canceled due to the issue with the incorrect version.  This
> issue will be fixed by https://github.com/apache/spark/pull/10317
>
> We can wait a little bit for a fix to
> https://issues.apache.org/jira/browse/SPARK-12345.  However if it looks
> like there is not an easy fix coming soon, I'm planning to move forward
> with RC3.
>
> On Mon, Dec 14, 2015 at 9:31 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> I'm afraid you're correct, Krishna:
>>
>> core/src/main/scala/org/apache/spark/package.scala:  val SPARK_VERSION =
>> "1.6.0-SNAPSHOT"
>> docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT
>>
>> On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar <ks...@gmail.com>
>> wrote:
>>
>>> Guys,
>>>    The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you
>>> pl verify ?
>>> Cheers
>>> <k/>
>>>
>>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc2
>>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>>
>>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>>
>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>    trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>    bugs in eviction of storage memory by execution.
>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>    passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>    Performance - Improve Parquet scan performance when using flat
>>>>    schemas.
>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>    on shared clusters.
>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>    Tungsten).
>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>    Memory Management - Shared memory for execution and caching instead
>>>>    of exclusive division of the regions.
>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>    files of any supported format without registering a table.
>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>    basis for memory usage and spilled data size.
>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>    arbitrary numbers of columns
>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>>    pushed-down filter.
>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>    query execution - Intial support for automatically selecting the
>>>>    number of reducers for joins and aggregations.
>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>    query planner for queries having distinct aggregations - Query
>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>    high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>>    - API Updates
>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>       improved state management - mapWithState - a DStream
>>>>       transformation for stateful stream processing, supercedes
>>>>       updateStateByKey in functionality and performance.
>>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>>        Kinesis record deaggregation - Kinesis streams have been
>>>>       upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>       KPL-aggregated records.
>>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>>        Kinesis message handler function - Allows arbitraray function
>>>>       to be applied to a Kinesis record in the Kinesis receiver before to
>>>>       customize what data is to be stored in memory.
>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>       delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>>    - UI Improvements
>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>       batch list, and batch details page.
>>>>       - Made output operations visible in the streaming tab as
>>>>       progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>    analysis - Log-linear model for survival analysis
>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>    equation for least squares - Normal equation solver, providing
>>>>    R-like model summary statistics
>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>    transformer
>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>>    - ML Pipelines
>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>>       of spark.ml algorithms
>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>>       Pipelines
>>>>    - R API
>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>>       squares via summary(model)
>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>       interactions in R formula - Interaction operator ":" in R formula
>>>>    - Python API - Many improvements to Python API to approach feature
>>>>    parity
>>>>
>>>> Misc improvements
>>>>
>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>>    weights
>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>    correlations, etc.
>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>    versions - Documentation includes initial version when classes and
>>>>    methods were added
>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>    example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>    deprecated.
>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>    for small errors (< 0.01), it uses absolute error.
>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>    Tokenizer transformer.
>>>>    - Spark SQL's partition discovery has been changed to only discover
>>>>    partition directories that are children of the given path. (i.e. if
>>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>>    partition but only children of x=1.) This behavior can be
>>>>    overridden by manually specifying the basePath that partitioning
>>>>    discovery should start with (SPARK-11678
>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>    casting a long value to timestamp), the value is treated as being in
>>>>    seconds instead of milliseconds (SPARK-11724
>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>    - With the improved query planner for queries having distinct
>>>>    aggregations (SPARK-9241
>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>    query having a single distinct aggregation has been changed to a more
>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>     to true (SPARK-12077
>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>>
>>>
>>
>


-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Michael Armbrust <mi...@databricks.com>.

This vote is canceled due to the issue with the incorrect version.  This
issue will be fixed by https://github.com/apache/spark/pull/10317

We can wait a little bit for a fix to
https://issues.apache.org/jira/browse/SPARK-12345.  However if it looks
like there is not an easy fix coming soon, I'm planning to move forward
with RC3.

On Mon, Dec 14, 2015 at 9:31 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> I'm afraid you're correct, Krishna:
>
> core/src/main/scala/org/apache/spark/package.scala:  val SPARK_VERSION =
> "1.6.0-SNAPSHOT"
> docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT
>
> On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar <ks...@gmail.com>
> wrote:
>
>> Guys,
>>    The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you
>> pl verify ?
>> Cheers
>> <k/>
>>
>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc2
>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>
>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.ml algorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Mark Hamstra <ma...@clearstorydata.com>.

I'm afraid you're correct, Krishna:

core/src/main/scala/org/apache/spark/package.scala:  val SPARK_VERSION =
"1.6.0-SNAPSHOT"
docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT

On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar <ks...@gmail.com> wrote:

> Guys,
>    The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you
> pl verify ?
> Cheers
> <k/>
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Krishna Sankar <ks...@gmail.com>.

Guys,
   The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you pl
verify ?
Cheers
<k/>

On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Kousuke Saruta <sa...@oss.nttdata.co.jp>.

+1 (non-binding)

Tested some workloads using basic API and DataFrame API on my 4-nodes 
YARN cluster (1 master and 3 slaves).
I also tested the Web UI.

(I'm resending this mail just in case because it seems that I failed to 
send the mail to dev@)
On 2015/12/13 2:39, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark 
> version 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and 
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc2 
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b) 
> <https://github.com/apache/spark/tree/v1.6.0-rc2>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/ 
> <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc2-bin/>
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be 
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc2-docs/>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking 
> an existing Spark workload and running on this release candidate, then 
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 
> votes should only occur for significant regressions from 1.5. Bugs 
> already present in 1.5, minor regressions, or bugs related to new 
> features will not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go 
> into branch-1.6, since documentations will be published separately 
> from the release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the 
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
>   Spark 1.6.0 Preview
>
>
>     Notable changes since 1.6 RC1
>
>
>       Spark Streaming
>
>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>     |trackStateByKey| has been renamed to |mapWithState|
>
>
>       Spark SQL
>
>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>     SPARK-12189
>     <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs in
>     eviction of storage memory by execution.
>   * SPARK-12258
>     <https://issues.apache.org/jira/browse/SPARK-12258> correct
>     passing null into ScalaUDF
>
>
>     Notable Features Since 1.5
>
>
>       Spark SQL
>
>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>     Parquet Performance - Improve Parquet scan performance when using
>     flat schemas.
>   * SPARK-10810
>     <https://issues.apache.org/jira/browse/SPARK-10810>Session
>     Management - Isolated devault database (i.e |USE mydb|) even on
>     shared clusters.
>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>     Dataset API - A type-safe API (similar to RDDs) that performs many
>     operations on serialized binary data and code generation (i.e.
>     Project Tungsten).
>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>     Unified Memory Management - Shared memory for execution and
>     caching instead of exclusive division of the regions.
>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197>
>     SQL Queries on Files - Concise syntax for running SQL queries over
>     files of any supported format without registering a table.
>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>     Reading non-standard JSON files - Added options to read
>     non-standard JSON files (e.g. single-quotes, unquoted attributes)
>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>     Per-operator Metrics for SQL Execution - Display statistics on a
>     peroperator basis for memory usage and spilled data size.
>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329>
>     Star (*) expansion for StructTypes - Makes it easier to nest and
>     unest arbitrary numbers of columns
>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>     In-memory Columnar Cache Performance - Significant (up to 14x)
>     speed up when caching data that contains complex types in
>     DataFrames or SQL.
>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111>
>     Fast null-safe joins - Joins using null-safe equality (|<=>|) will
>     now execute using SortMergeJoin instead of computing a cartisian
>     product.
>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389>
>     SQL Execution Using Off-Heap Memory - Support for configuring
>     query execution to occur using off-heap memory to avoid GC overhead
>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>     Datasource API Avoid Double Filter - When implemeting a datasource
>     with filter pushdown, developers can now tell Spark SQL to avoid
>     double evaluating a pushed-down filter.
>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>     Advanced Layout of Cached Data - storing partitioning and ordering
>     schemes in In-memory table scan, and adding distributeBy and
>     localSort to DF API
>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>     Adaptive query execution - Intial support for automatically
>     selecting the number of reducers for joins and aggregations.
>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>     Improved query planner for queries having distinct aggregations -
>     Query plans of distinct aggregations are more robust when distinct
>     columns have high cardinality.
>
>
>       Spark Streaming
>
>   * API Updates
>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>         New improved state management - |mapWithState| - a DStream
>         transformation for stateful stream processing, supercedes
>         |updateStateByKey| in functionality and performance.
>       o SPARK-11198
>         <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>         record deaggregation - Kinesis streams have been upgraded to
>         use KCL 1.4.0 and supports transparent deaggregation of
>         KPL-aggregated records.
>       o SPARK-10891
>         <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>         message handler function - Allows arbitraray function to be
>         applied to a Kinesis record in the Kinesis receiver before to
>         customize what data is to be stored in memory.
>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>         Python Streamng Listener API - Get streaming statistics
>         (scheduling delays, batch processing times, etc.) in streaming.
>
>   * UI Improvements
>       o Made failures visible in the streaming tab, in the timelines,
>         batch list, and batch details page.
>       o Made output operations visible in the streaming tab as
>         progress bars.
>
>
>       MLlib
>
>
>         New algorithms/models
>
>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>     Survival analysis - Log-linear model for survival analysis
>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834>
>     Normal equation for least squares - Normal equation solver,
>     providing R-like model summary statistics
>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147>
>     Online hypothesis testing - A/B testing in the Spark Streaming
>     framework
>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>     transformer
>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>     Bisecting K-Means clustering - Fast top-down clustering variant of
>     K-Means
>
>
>         API improvements
>
>   * ML Pipelines
>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>         Pipeline persistence - Save/load for ML Pipelines, with
>         partial coverage of spark.ml <http://spark.ml> algorithms
>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>         ML Pipelines
>   * R API
>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>         R-like statistics for GLMs - (Partial) R-like stats for
>         ordinary least squares via summary(model)
>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>         Feature interactions in R formula - Interaction operator ":"
>         in R formula
>   * Python API - Many improvements to Python API to approach feature
>     parity
>
>
>         Misc improvements
>
>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>     Instance weights for GLMs - Logistic and Linear Regression can
>     take instance weights
>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>     Univariate and bivariate statistics in DataFrames - Variance,
>     stddev, correlations, etc.
>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>     LIBSVM data source - LIBSVM as a SQL data source
>
>
>             Documentation improvements
>
>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751>
>     @since versions - Documentation includes initial version when
>     classes and methods were added
>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>     Testable example code - Automated testing for code in user guide
>     examples
>
>
>     Deprecations
>
>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>     deprecated.
>   * In spark.ml.classification.LogisticRegressionModel and
>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>     been deprecated, in favor of the new name "coefficients." This
>     helps disambiguate from instance (row) weights given to algorithms.
>
>
>     Changes of behavior
>
>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>     semantics in 1.6. Previously, it was a threshold for absolute
>     change in error. Now, it resembles the behavior of GradientDescent
>     convergenceTol: For large errors, it uses relative error (relative
>     to the previous error); for small errors (< 0.01), it uses
>     absolute error.
>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>     strings to lowercase before tokenizing. Now, it converts to
>     lowercase by default, with an option not to. This matches the
>     behavior of the simpler Tokenizer transformer.
>   * Spark SQL's partition discovery has been changed to only discover
>     partition directories that are children of the given path. (i.e.
>     if |path="/my/data/x=1"| then |x=1| will no longer be considered a
>     partition but only children of |x=1|.) This behavior can be
>     overridden by manually specifying the |basePath| that partitioning
>     discovery should start with (SPARK-11678
>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>   * When casting a value of an integral type to timestamp (e.g.
>     casting a long value to timestamp), the value is treated as being
>     in seconds instead of milliseconds (SPARK-11724
>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>   * With the improved query planner for queries having distinct
>     aggregations (SPARK-9241
>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>     query having a single distinct aggregation has been changed to a
>     more robust version. To switch back to the plan generated by Spark
>     1.5's planner, please set
>     |spark.sql.specializeSingleDistinctAggPlanning| to
>     |true| (SPARK-12077
>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Michael Armbrust <mi...@databricks.com>.

Here are a fixed version of the docs for 1.6:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docsfixed-docs

There still might be some minor rendering issues of the ML page, but people
are investigating.

On Sat, Dec 12, 2015 at 6:58 PM, Burak Yavuz <br...@gmail.com> wrote:

> +1 tested SparkSQL and Streaming on some production sized workloads
>
> On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> +1
>>
>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc2
>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>
>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.ml algorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Joseph Bradley <jo...@databricks.com>.

+1
Ran all tests locally on Mac OS X, and MLlib with large workloads on a
cluster.

On Sat, Dec 12, 2015 at 6:58 PM, Burak Yavuz <br...@gmail.com> wrote:

> +1 tested SparkSQL and Streaming on some production sized workloads
>
> On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> +1
>>
>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc2
>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>
>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.ml algorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Burak Yavuz <br...@gmail.com>.

+1 tested SparkSQL and Streaming on some production sized workloads

On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> +1
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Mark Hamstra <ma...@clearstorydata.com>.

+1

On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Andrew Or <an...@databricks.com>.

+1

Ran PageRank on standalone mode with 4 nodes and noticed a speedup after
the specific commits that were in RC2 but not RC1:

c247b6a Dec 10 [SPARK-12155][SPARK-12253] Fix executor OOM in unified
memory management
05e441e Dec 9 [SPARK-12165][SPARK-12189] Fix bugs in eviction of storage
memory by execution

Also jobs that triggered these issues now run successfully.


2015-12-14 10:45 GMT-08:00 Reynold Xin <rx...@databricks.com>:

> +1
>
> Tested some dataframe operations on my Mac.
>
>
> On Saturday, December 12, 2015, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.ml algorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    -
>>
>>

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Posted by Reynold Xin <rx...@databricks.com>.

+1

Tested some dataframe operations on my Mac.

On Saturday, December 12, 2015, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> <https://github.com/apache/spark/tree/v1.6.0-rc2>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.ml algorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    -
>
>