You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2015/12/16 22:32:14 UTC

[VOTE] Release Apache Spark 1.6.0 (RC3)

Please vote on releasing the following candidate as Apache Spark version
1.6.0!

The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is *v1.6.0-rc3
(168c89e07c51fa24b0bb88582c739cec0acb44d7)
<https://github.com/apache/spark/tree/v1.6.0-rc3>*

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1174/

The test repository (versioned as v1.6.0-rc3) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1173/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/

=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.

===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.


==================================================
== Major changes to help you focus your testing ==
==================================================

Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed

Notable changes since 1.6 RC1
Spark Streaming

   - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
   trackStateByKey has been renamed to mapWithState

Spark SQL

   - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
   SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs
   in eviction of storage memory by execution.
   - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
   passing null into ScalaUDF

Notable Features Since 1.5Spark SQL

   - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
   Performance - Improve Parquet scan performance when using flat schemas.
   - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
   Session Management - Isolated devault database (i.e USE mydb) even on
   shared clusters.
   - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
   API - A type-safe API (similar to RDDs) that performs many operations on
   serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
   Memory Management - Shared memory for execution and caching instead of
   exclusive division of the regions.
   - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
   Queries on Files - Concise syntax for running SQL queries over files of
   any supported format without registering a table.
   - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
   non-standard JSON files - Added options to read non-standard JSON files
   (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator
   Metrics for SQL Execution - Display statistics on a peroperator basis
   for memory usage and spilled data size.
   - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
   (*) expansion for StructTypes - Makes it easier to nest and unest
   arbitrary numbers of columns
   - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
   SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
   Columnar Cache Performance - Significant (up to 14x) speed up when
   caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
   null-safe joins - Joins using null-safe equality (<=>) will now execute
   using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
   Execution Using Off-Heap Memory - Support for configuring query
   execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
   API Avoid Double Filter - When implemeting a datasource with filter
   pushdown, developers can now tell Spark SQL to avoid double evaluating a
   pushed-down filter.
   - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
   Layout of Cached Data - storing partitioning and ordering schemes in
   In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
   query execution - Intial support for automatically selecting the number
   of reducers for joins and aggregations.
   - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
   query planner for queries having distinct aggregations - Query plans of
   distinct aggregations are more robust when distinct columns have high
   cardinality.

Spark Streaming

   - API Updates
      - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
      improved state management - mapWithState - a DStream transformation
      for stateful stream processing, supercedes updateStateByKey in
      functionality and performance.
      - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
      record deaggregation - Kinesis streams have been upgraded to use KCL
      1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
      message handler function - Allows arbitraray function to be applied
      to a Kinesis record in the Kinesis receiver before to customize what data
      is to be stored in memory.
      - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
      Streamng Listener API - Get streaming statistics (scheduling delays,
      batch processing times, etc.) in streaming.


   - UI Improvements
      - Made failures visible in the streaming tab, in the timelines, batch
      list, and batch details page.
      - Made output operations visible in the streaming tab as progress
      bars.

MLlibNew algorithms/models

   - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
   analysis - Log-linear model for survival analysis
   - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
   equation for least squares - Normal equation solver, providing R-like
   model summary statistics
   - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
   hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
   feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
   transformer
   - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
   K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

   - ML Pipelines
      - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
      persistence - Save/load for ML Pipelines, with partial coverage of
      spark.mlalgorithms
      - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
      in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
   - R API
      - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
      statistics for GLMs - (Partial) R-like stats for ordinary least
      squares via summary(model)
      - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
      interactions in R formula - Interaction operator ":" in R formula
   - Python API - Many improvements to Python API to approach feature parity

Misc improvements

   - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
   SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
   weights for GLMs - Logistic and Linear Regression can take instance
   weights
   - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
   SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
   and bivariate statistics in DataFrames - Variance, stddev, correlations,
   etc.
   - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
   data source - LIBSVM as a SQL data sourceDocumentation improvements
   - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
   versions - Documentation includes initial version when classes and
   methods were added
   - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
   example code - Automated testing for code in user guide examples

Deprecations

   - In spark.mllib.clustering.KMeans, the "runs" parameter has been
   deprecated.
   - In spark.ml.classification.LogisticRegressionModel and
   spark.ml.regression.LinearRegressionModel, the "weights" field has been
   deprecated, in favor of the new name "coefficients." This helps
   disambiguate from instance (row) weights given to algorithms.

Changes of behavior

   - spark.mllib.tree.GradientBoostedTrees validationTol has changed
   semantics in 1.6. Previously, it was a threshold for absolute change in
   error. Now, it resembles the behavior of GradientDescent convergenceTol:
   For large errors, it uses relative error (relative to the previous error);
   for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert
   strings to lowercase before tokenizing. Now, it converts to lowercase by
   default, with an option not to. This matches the behavior of the simpler
   Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover
   partition directories that are children of the given path. (i.e. if
   path="/my/data/x=1" then x=1 will no longer be considered a partition
   but only children of x=1.) This behavior can be overridden by manually
   specifying the basePath that partitioning discovery should start with (
   SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
   - When casting a value of an integral type to timestamp (e.g. casting a
   long value to timestamp), the value is treated as being in seconds instead
   of milliseconds (SPARK-11724
   <https://issues.apache.org/jira/browse/SPARK-11724>).
   - With the improved query planner for queries having distinct
   aggregations (SPARK-9241
   <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a query
   having a single distinct aggregation has been changed to a more robust
   version. To switch back to the plan generated by Spark 1.5's planner,
   please set spark.sql.specializeSingleDistinctAggPlanning to true (
   SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Daniel Darabos <da...@lynxanalytics.com>.

+1 (non-binding)

It passes our tests after we registered 6 new classes with Kryo:


kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow])
    kryo.register(classOf[Array[org.apache.spark.mllib.tree.model.Split]])

    kryo.register(Class.forName("org.apache.spark.mllib.tree.model.Bin"))
    kryo.register(Class.forName("[Lorg.apache.spark.mllib.tree.model.Bin;"))

kryo.register(Class.forName("org.apache.spark.mllib.tree.model.DummyLowSplit"))

kryo.register(Class.forName("org.apache.spark.mllib.tree.model.DummyHighSplit"))

It also spams "Managed memory leak detected; size = 15735058 bytes, TID =
847" for almost every task. I haven't yet figured out why.


On Fri, Dec 18, 2015 at 6:45 AM, Krishna Sankar <ks...@gmail.com> wrote:

> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
>      mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0)
> 2.0 Spark version is 1.6.0
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>        Center And Scale OK
> 2.5. RDD operations OK
>       State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>        Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK (--packages
> com.databricks:spark-csv_2.10:1.3.0)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Cheers & Good work guys
> <k/>
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.mlalgorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Krishna Sankar <ks...@gmail.com>.

+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
     mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
       Center And Scale OK
2.5. RDD operations OK
      State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
       Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.3.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Cheers & Good work guys
<k/>

On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Denny Lee <de...@gmail.com>.

+1 (non-binding)

Tested a number of tests surrounding DataFrames, Datasets, and ML.


On Wed, Dec 16, 2015 at 1:32 PM Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

+1 (non binding)

Tested in standalone and yarn with different samples.

Regards
JB

On 12/16/2015 10:32 PM, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will
> not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
>   Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
>
>   Notable changes since 1.6 RC1
>
>
>       Spark Streaming
>
>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>     |trackStateByKey| has been renamed to |mapWithState|
>
>
>       Spark SQL
>
>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>     SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>     bugs in eviction of storage memory by execution.
>   * SPARK-12258
>     <https://issues.apache.org/jira/browse/SPARK-12258> correct passing
>     null into ScalaUDF
>
>
>     Notable Features Since 1.5
>
>
>       Spark SQL
>
>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>     Parquet Performance - Improve Parquet scan performance when using
>     flat schemas.
>   * SPARK-10810
>     <https://issues.apache.org/jira/browse/SPARK-10810>Session
>     Management - Isolated devault database (i.e |USE mydb|) even on
>     shared clusters.
>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>     Dataset API - A type-safe API (similar to RDDs) that performs many
>     operations on serialized binary data and code generation (i.e.
>     Project Tungsten).
>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>     Unified Memory Management - Shared memory for execution and caching
>     instead of exclusive division of the regions.
>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>     Queries on Files - Concise syntax for running SQL queries over files
>     of any supported format without registering a table.
>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>     Reading non-standard JSON files - Added options to read non-standard
>     JSON files (e.g. single-quotes, unquoted attributes)
>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>     Per-operator Metrics for SQL Execution - Display statistics on a
>     peroperator basis for memory usage and spilled data size.
>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>     (*) expansion for StructTypes - Makes it easier to nest and unest
>     arbitrary numbers of columns
>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>     In-memory Columnar Cache Performance - Significant (up to 14x) speed
>     up when caching data that contains complex types in DataFrames or SQL.
>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>     null-safe joins - Joins using null-safe equality (|<=>|) will now
>     execute using SortMergeJoin instead of computing a cartisian product.
>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>     Execution Using Off-Heap Memory - Support for configuring query
>     execution to occur using off-heap memory to avoid GC overhead
>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>     Datasource API Avoid Double Filter - When implemeting a datasource
>     with filter pushdown, developers can now tell Spark SQL to avoid
>     double evaluating a pushed-down filter.
>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>     Advanced Layout of Cached Data - storing partitioning and ordering
>     schemes in In-memory table scan, and adding distributeBy and
>     localSort to DF API
>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>     Adaptive query execution - Intial support for automatically
>     selecting the number of reducers for joins and aggregations.
>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>     Improved query planner for queries having distinct aggregations -
>     Query plans of distinct aggregations are more robust when distinct
>     columns have high cardinality.
>
>
>       Spark Streaming
>
>   * API Updates
>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>         New improved state management - |mapWithState| - a DStream
>         transformation for stateful stream processing, supercedes
>         |updateStateByKey| in functionality and performance.
>       o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>         Kinesis record deaggregation - Kinesis streams have been
>         upgraded to use KCL 1.4.0 and supports transparent deaggregation
>         of KPL-aggregated records.
>       o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>         Kinesis message handler function - Allows arbitraray function to
>         be applied to a Kinesis record in the Kinesis receiver before to
>         customize what data is to be stored in memory.
>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>         Python Streamng Listener API - Get streaming statistics
>         (scheduling delays, batch processing times, etc.) in streaming.
>
>   * UI Improvements
>       o Made failures visible in the streaming tab, in the timelines,
>         batch list, and batch details page.
>       o Made output operations visible in the streaming tab as progress
>         bars.
>
>
>       MLlib
>
>
>         New algorithms/models
>
>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>     Survival analysis - Log-linear model for survival analysis
>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>     equation for least squares - Normal equation solver, providing
>     R-like model summary statistics
>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>     hypothesis testing - A/B testing in the Spark Streaming framework
>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>     transformer
>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>     Bisecting K-Means clustering - Fast top-down clustering variant of
>     K-Means
>
>
>         API improvements
>
>   * ML Pipelines
>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>         Pipeline persistence - Save/load for ML Pipelines, with partial
>         coverage of spark.ml <http://spark.ml/>algorithms
>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
>         Pipelines
>   * R API
>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>         R-like statistics for GLMs - (Partial) R-like stats for ordinary
>         least squares via summary(model)
>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>         Feature interactions in R formula - Interaction operator ":" in
>         R formula
>   * Python API - Many improvements to Python API to approach feature parity
>
>
>         Misc improvements
>
>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>     Instance weights for GLMs - Logistic and Linear Regression can take
>     instance weights
>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>     Univariate and bivariate statistics in DataFrames - Variance,
>     stddev, correlations, etc.
>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>     LIBSVM data source - LIBSVM as a SQL data source
>
>
>             Documentation improvements
>
>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>     versions - Documentation includes initial version when classes and
>     methods were added
>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>     Testable example code - Automated testing for code in user guide
>     examples
>
>
>     Deprecations
>
>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>     deprecated.
>   * In spark.ml.classification.LogisticRegressionModel and
>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>     been deprecated, in favor of the new name "coefficients." This helps
>     disambiguate from instance (row) weights given to algorithms.
>
>
>     Changes of behavior
>
>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>     semantics in 1.6. Previously, it was a threshold for absolute change
>     in error. Now, it resembles the behavior of GradientDescent
>     convergenceTol: For large errors, it uses relative error (relative
>     to the previous error); for small errors (< 0.01), it uses absolute
>     error.
>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>     strings to lowercase before tokenizing. Now, it converts to
>     lowercase by default, with an option not to. This matches the
>     behavior of the simpler Tokenizer transformer.
>   * Spark SQL's partition discovery has been changed to only discover
>     partition directories that are children of the given path. (i.e. if
>     |path="/my/data/x=1"| then |x=1| will no longer be considered a
>     partition but only children of |x=1|.) This behavior can be
>     overridden by manually specifying the |basePath| that partitioning
>     discovery should start with (SPARK-11678
>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>   * When casting a value of an integral type to timestamp (e.g. casting
>     a long value to timestamp), the value is treated as being in seconds
>     instead of milliseconds (SPARK-11724
>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>   * With the improved query planner for queries having distinct
>     aggregations (SPARK-9241
>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>     query having a single distinct aggregation has been changed to a
>     more robust version. To switch back to the plan generated by Spark
>     1.5's planner, please set
>     |spark.sql.specializeSingleDistinctAggPlanning| to
>     |true| (SPARK-12077
>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Michael Armbrust <mi...@databricks.com>.

It's come to my attention that there have been several bug fixes merged
since RC3:

  - SPARK-12404 - Fix serialization error for Datasets with
Timestamps/Arrays/Decimal
  - SPARK-12218 - Fix incorrect pushdown of filters to parquet
  - SPARK-12395 - Fix join columns of outer join for DataFrame using
  - SPARK-12413 - Fix mesos HA

Normally, these would probably not be sufficient to hold the release,
however with the holidays going on in the US this week, we don't have the
resources to finalize 1.6 until next Monday.  Given this delay anyway, I
propose that we cut one final RC with the above fixes and plan for the
actual release first thing next week.

I'll post RC4 shortly and cancel this vote if there are no objections.
Since this vote nearly passed with no major issues, I don't anticipate any
problems with RC4.

Michael

On Sat, Dec 19, 2015 at 11:44 PM, Jeff Zhang <zj...@gmail.com> wrote:

> +1 (non-binding)
>
> All the test passed, and run it on HDP 2.3.2 sandbox successfully.
>
> On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende <lu...@gmail.com>
> wrote:
>
>> +1 (non-binding)
>>
>> Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
>>
>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc3
>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>
>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.mlalgorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

回复： [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Ricky <49...@qq.com>.

SizeBasedRollingPolicy print too many               log  when spark.executor.logs.rolling.strategy is size , shouldRollover use logInfo method:
          
  def shouldRollover(bytesToBeWritten: Long): Boolean = {
    logInfo(s"$bytesToBeWritten + $bytesWrittenSinceRollover > $rolloverSizeBytes")
    bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes
  }



 logs.rolling:
  spark.executor.logs.rolling.strategy size
  spark.executor.logs.rolling.maxSize 134217728
  spark.executor.logs.rolling.maxRetainedFiles 8


 Can use logdebug instead of loginfo ?


   
          
          
                               
                       
      
 
                            ------------------         
                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                            Best Regards                                                                                                                                                                                                                                                                                                                                                                                                            
                                                 
                                                                                                                                                                           Ricky Yang                                                                                                                                                               
                                                                                                                                                                                         
                                                                                                                                                             
                                  
                                                                            
                                                                            
             
         
      
        
                            
         
                      
         
                      ------------------ 原始邮件 ------------------         
                                                            发件人:                                   "Jeff Zhang";<zj...@gmail.com>;             
                                                   发送时间:                                   2015年12月20日(星期天) 下午3:44             
                                                   收件人:                                   "Luciano Resende"<lu...@gmail.com>;                                  
                                                               抄送:                                           "Michael Armbrust"<mi...@databricks.com>; "dev@spark.apache.org"<de...@spark.apache.org>;                                          
                                                                           主题:                                                   Re: [VOTE] Release Apache Spark 1.6.0 (RC3)                     
                 
                                      
                 
                                      +1 (non-binding)                                              
                     
                                              All the test passed, and run it on HDP 2.3.2 sandbox successfully.                     
                 
                                      
                                              On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende                                                      <                                                              luckbr1975@gmail.com                                                          >                                                  wrote:                         
                                                                                                                            +1 (non-binding)                                     
                                 
                                                                      
                                 
                                 Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.                                 
                             
                                                              
                                                                                                               On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust                                                                                      <                                                                                              michael@databricks.com                                                                                          >                                                                                  wrote:                                         
                                                                                                                                                                                                                                                                                                                                                                                            Please vote on releasing the following candidate as Apache Spark version 1.6.0!                                                                                                              
                                                                                                                                                                           
                                                             The vote is open until Saturday, December 19, 2015 at 18                                                                                                                              :00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.                                                                                                                          
                                                                                                              
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           [ ] +1 Release this package as Apache Spark 1.6.0                                                                                                              
                                                                                                                                                                           [ ] -1 Do not release this package because ...                                                                                                              
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           To learn more about Apache Spark, please see                                                                                                                               http://spark.apache.org/                                                                                                                                                                           
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           The tag to be voted on is                                                                                                                                                                                                    v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)                                                                                                                                                                                                                                            
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           The release files, including signatures, digests, etc. can be found at:                                                                                                              
                                                                                                                                                                                                                                            http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/                                                                                                                                                                           
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           Release artifacts are signed with the following key:                                                                                                              
                                                                                                                                                                                                                                            https://people.apache.org/keys/committer/pwendell.asc                                                                                                                                                                           
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                                                                                            The staging repository for this release can be found at:                                                                                                                      
                                                                                                                                                                                                                                                            https://repository.apache.org/content/repositories/orgapachespark-1174/                                                                                                                                                                                       
                                                     
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           The test repository (versioned as v1.6.0-rc3) for this release can be found at:                                                                                                              
                                                                                                                                                                                                                                            https://repository.apache.org/content/repositories/orgapachespark-1173/                                                                                                                                                                           
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           The documentation corresponding to this release can be found at:                                                                                                              
                                                                                                                                                                                                                                            http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/                                                                                                                                                                           
                                                                                                              
                                                     
                                                                                                                                                                           =======================================                                                                                                              
                                                                                                                                                                           == How can I help test this release? ==                                                                                                              
                                                                                                                                                                           =======================================                                                                                                              
                                                                                                                                                                           If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.                                                                                                              
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           ================================================                                                                                                              
                                                                                                                                                                           == What justifies a -1 vote for this release? ==                                                                                                              
                                                                                                                                                                           ================================================                                                                                                              
                                                                                                                                                                           This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.                                                                                                              
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           ===============================================================                                                                                                              
                                                                                                                                                                           == What should happen to JIRA tickets still targeting 1.6.0? ==                                                                                                              
                                                                                                                                                                           ===============================================================                                                                                                              
                                                                                                                                                                           1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.                                                                                                              
                                                                                                                                                                           2. New features for non-alpha-modules should target 1.7+.                                                                                                              
                                                                                                                                                                           3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.                                                                                                              
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           
                                                                                                              
                                                                                                                                                                           ==================================================                                                                                                              
                                                                                                                                                                           == Major changes to help you focus your testing ==                                                                                                              
                                                                                                                                                                           ==================================================                                                                                                              
                                                                                                              
                                                     
                                                                                                              
                                                                                                                              Notable changes since 1.6 RC2                                                                                                                      
                                                                                                                      
                                                             - SPARK_VERSION has been set correctly                                                             
                                                             - SPARK-12199 ML Docs are publishing correctly                                                             
                                                             - SPARK-12345 Mesos cluster mode has been fixed                                                                                                                                                                                                                                                
                                                                                                                      
                                                         
                                                                                                                              Notable changes since 1.6 RC1                                                                                                                          
                                                         
                                                         
                                                                                                                              Spark Streaming                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               SPARK-2629                                                                                                                                                                                                                                                                                          trackStateByKey                                                                                                                                           has been renamed to                                                                                                                                               mapWithState                                                                                                                                                                                                   
                                                                                                                  
                                                                                                                              Spark SQL                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               SPARK-12165                                                                                                                                                                                                                                                                                         SPARK-12189                                                                                                                                           Fix bugs in eviction of storage memory by execution.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-12258                                                                                                                                           correct passing null into ScalaUDF                                                                                                                              
                                                                                                                  
                                                                                                                              Notable Features Since 1.5                                                                                                                      
                                                         
                                                                                                                              Spark SQL                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               SPARK-11787                                                                                                                                                                                                                                                                                         Parquet Performance                                                                                                                                           - Improve Parquet scan performance when using flat schemas.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10810                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Session                                                                                                                                                                                                                                                                                                 Management                                                                                                                                                                                                                                - Isolated devault database (i.e                                                                                                                                                                                                                                                                                             USE mydb                                                                                                                                                                                                                                                                                                ) even on shared clusters.                                                                                                                                                                                                                                                                            
                                                             
                                                                                                                                                                                                               SPARK-9999                                                                                                                                                                                                                                                                                          Dataset API                                                                                                                                           - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10000                                                                                                                                                                                                                                                                                         Unified Memory Management                                                                                                                                           - Shared memory for execution and caching instead of exclusive division of the regions.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-11197                                                                                                                                                                                                                                                                                         SQL Queries on Files                                                                                                                                           - Concise syntax for running SQL queries over files of any supported format without registering a table.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-11745                                                                                                                                                                                                                                                                                         Reading non-standard JSON files                                                                                                                                           - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10412                                                                                                                                                                                                                                                                                         Per-operator Metrics for SQL Execution                                                                                                                                           - Display statistics on a peroperator basis for memory usage and spilled data size.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-11329                                                                                                                                                                                                                                                                                         Star (*) expansion for StructTypes                                                                                                                                           - Makes it easier to nest and unest arbitrary numbers of columns                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10917                                                                                                                                          ,                                                                                                                                               SPARK-11149                                                                                                                                                                                                                                                                                         In-memory Columnar Cache Performance                                                                                                                                           - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-11111                                                                                                                                                                                                                                                                                         Fast null-safe joins                                                                                                                                           - Joins using null-safe equality (                                                                                                                                              <=>                                                                                                                                          ) will now execute using SortMergeJoin instead of computing a cartisian product.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-11389                                                                                                                                                                                                                                                                                         SQL Execution Using Off-Heap Memory                                                                                                                                           - Support for configuring query execution to occur using off-heap memory to avoid GC overhead                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10978                                                                                                                                                                                                                                                                                         Datasource API Avoid Double Filter                                                                                                                                           - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-4849                                                                                                                                                                                                                                                                                          Advanced Layout of Cached Data                                                                                                                                           - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-9858                                                                                                                                                                                                                                                                                          Adaptive query execution                                                                                                                                           - Intial support for automatically selecting the number of reducers for joins and aggregations.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-9241                                                                                                                                                                                                                                                                                          Improved query planner for queries having distinct aggregations                                                                                                                                           - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.                                                                                                                              
                                                                                                                  
                                                                                                                              Spark Streaming                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               API Updates                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                       SPARK-2629                                                                                                                                                                                                                                                                                                                          New improved state management                                                                                                                                                           -                                                                                                                                                               mapWithState                                                                                                                                                           - a DStream transformation for stateful stream processing, supercedes                                                                                                                                                               updateStateByKey                                                                                                                                                           in functionality and performance.                                                                                                                                              
                                                                     
                                                                                                                                                                                                                                       SPARK-11198                                                                                                                                                                                                                                                                                                                         Kinesis record deaggregation                                                                                                                                                           - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.                                                                                                                                              
                                                                     
                                                                                                                                                                                                                                       SPARK-10891                                                                                                                                                                                                                                                                                                                         Kinesis message handler function                                                                                                                                                           - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.                                                                                                                                              
                                                                     
                                                                                                                                                                                                                                       SPARK-6328                                                                                                                                                                                                                                                                                                                          Python Streamng Listener API                                                                                                                                                           - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.                                                                                                                                              
                                                                                                                              
                                                                                                                                                                               
                                                                                                                                                                                                               UI Improvements                                                                                                                                                                                                                                                                            
                                                                                                                                                      Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.                                                                                                                                              
                                                                     
                                                                                                                                                      Made output operations visible in the streaming tab as progress bars.                                                                                                                                              
                                                                                                                              
                                                                                                                  
                                                                                                                              MLlib                                                                                                                      
                                                         
                                                                                                                              New algorithms/models                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               SPARK-8518                                                                                                                                                                                                                                                                                          Survival analysis                                                                                                                                           - Log-linear model for survival analysis                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-9834                                                                                                                                                                                                                                                                                          Normal equation for least squares                                                                                                                                           - Normal equation solver, providing R-like model summary statistics                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-3147                                                                                                                                                                                                                                                                                          Online hypothesis testing                                                                                                                                           - A/B testing in the Spark Streaming framework                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-9930                                                                                                                                                                                                                                                                                          New feature transformers                                                                                                                                           - ChiSqSelector, QuantileDiscretizer, SQL transformer                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-6517                                                                                                                                                                                                                                                                                          Bisecting K-Means clustering                                                                                                                                           - Fast top-down clustering variant of K-Means                                                                                                                              
                                                                                                                  
                                                                                                                              API improvements                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               ML Pipelines                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                       SPARK-6725                                                                                                                                                                                                                                                                                                                          Pipeline persistence                                                                                                                                                           - Save/load for ML Pipelines, with partial coverage of                                                                                                                                                               spark.ml                                                                                                                                                          algorithms                                                                                                                                              
                                                                     
                                                                                                                                                                                                                                       SPARK-5565                                                                                                                                                                                                                                                                                                                          LDA in ML Pipelines                                                                                                                                                           - API for Latent Dirichlet Allocation in ML Pipelines                                                                                                                                              
                                                                                                                              
                                                             
                                                                                                                                                                                                               R API                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                       SPARK-9836                                                                                                                                                                                                                                                                                                                          R-like statistics for GLMs                                                                                                                                                           - (Partial) R-like stats for ordinary least squares via summary(model)                                                                                                                                              
                                                                     
                                                                                                                                                                                                                                       SPARK-9681                                                                                                                                                                                                                                                                                                                          Feature interactions in R formula                                                                                                                                                           - Interaction operator ":" in R formula                                                                                                                                              
                                                                                                                              
                                                             
                                                                                                                                                                                                               Python API                                                                                                                                           - Many improvements to Python API to approach feature parity                                                                                                                              
                                                                                                                  
                                                                                                                              Misc improvements                                                                                                                      
                                                                                                                      
                                                                                                                                                                                                               SPARK-7685                                                                                                                                           ,                                                                                                                                               SPARK-9642                                                                                                                                                                                                                                                                                          Instance weights for GLMs                                                                                                                                           - Logistic and Linear Regression can take instance weights                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10384                                                                                                                                          ,                                                                                                                                               SPARK-10385                                                                                                                                                                                                                                                                                         Univariate and bivariate statistics in DataFrames                                                                                                                                           - Variance, stddev, correlations, etc.                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-10117                                                                                                                                                                                                                                                                                         LIBSVM data source                                                                                                                                           - LIBSVM as a SQL data source                                                                                                                                  
                                                                                                                                              Documentation improvements                                                                                                                                      
                                                             
                                                             
                                                                                                                                                                                                               SPARK-7751                                                                                                                                                                                                                                                                                          @since versions                                                                                                                                           - Documentation includes initial version when classes and methods were added                                                                                                                              
                                                             
                                                                                                                                                                                                               SPARK-11337                                                                                                                                                                                                                                                                                         Testable example code                                                                                                                                           - Automated testing for code in user guide examples                                                                                                                              
                                                                                                                  
                                                                                                                              Deprecations                                                                                                                      
                                                                                                                      
                                                                                                                                      In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.                                                                                                                              
                                                             
                                                                                                                                      In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.                                                                                                                              
                                                                                                                  
                                                                                                                              Changes of behavior                                                                                                                      
                                                                                                                      
                                                                                                                                      spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.                                                                                                                              
                                                             
                                                                                                                                      spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.                                                                                                                              
                                                             
                                                                                                                                      Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if                                                                                                                                               path="/my/data/x=1"                                                                                                                                           then                                                                                                                                               x=1                                                                                                                                           will no longer be considered a partition but only children of                                                                                                                                               x=1                                                                                                                                          .) This behavior can be overridden by manually specifying the                                                                                                                                               basePath                                                                                                                                           that partitioning discovery should start with (                                                                                                                                              SPARK-11678                                                                                                                                          ).                                                                                                                              
                                                             
                                                                                                                                      When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (                                                                                                                                              SPARK-11724                                                                                                                                          ).                                                                                                                              
                                                             
                                                                                                                                      With the improved query planner for queries having distinct aggregations (                                                                                                                                              SPARK-9241                                                                                                                                          ), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set                                                                                                                                               spark.sql.specializeSingleDistinctAggPlanning                                                                                                                                           to                                                                                                                                               true                                                                                                                                           (                                                                                                                                              SPARK-12077                                                                                                                                          ).                                                                                                                              
                                                                                                              
                                                 
                                                                                      
                                     
                                 
                                                                                                               
                                         
                                         
                                         --                                         
                                                                                      Luciano Resende                                             
                                                                                              http://people.apache.org/~lresende                                                                                          
                                                                                              http://twitter.com/lresende1975                                                                                          
                                                                                              http://lresende.blogspot.com/                                                                                      
                                                                                                   
                                              
                     
                     
                                              
                     
                     --                     
                                              Best Regards                         
                         
                         Jeff Zhang

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Jeff Zhang <zj...@gmail.com>.

+1 (non-binding)

All the test passed, and run it on HDP 2.3.2 sandbox successfully.

On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende <lu...@gmail.com>
wrote:

> +1 (non-binding)
>
> Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.mlalgorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>



-- 
Best Regards

Jeff Zhang

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Luciano Resende <lu...@gmail.com>.

+1 (non-binding)

Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.

On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Michael Armbrust <mi...@databricks.com>.

+1

On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:

> +1
>
> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
> <https://github.com/apache/spark/pull/10332>).
>
> Also tested on standalone client and cluster mode. No problems.
>
> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>
>> I also noticed that spark.replClassServer.host and
>> spark.replClassServer.port aren’t used anymore. The transport now happens
>> over the main RpcEnv.
>>
>> Kind regards,
>> Radek Gruchalski
>> radek@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>
>> I was going to say that spark.executor.port is not used anymore in
>> 1.6, but damn, there's still that akka backend hanging around there
>> even when netty is being used... we should fix this, should be a
>> simple one-liner.
>>
>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>> wrote:
>>
>> -0 (non-binding)
>>
>> I have observed that when we set spark.executor.port in 1.6, we get
>> thrown a
>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>> anyone else seeing this?
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Yin Huai <yh...@databricks.com>.

+1

On Wed, Dec 16, 2015 at 7:19 PM, Patrick Wendell <pw...@gmail.com> wrote:

> +1
>
> On Wed, Dec 16, 2015 at 6:15 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Ran test suite (minus docker-integration-tests)
>> All passed
>>
>> +1
>>
>> [INFO] Spark Project External ZeroMQ ...................... SUCCESS [
>> 13.647 s]
>> [INFO] Spark Project External Kafka ....................... SUCCESS [
>> 45.424 s]
>> [INFO] Spark Project Examples ............................. SUCCESS
>> [02:06 min]
>> [INFO] Spark Project External Kafka Assembly .............. SUCCESS [
>> 11.280 s]
>> [INFO]
>> ------------------------------------------------------------------------
>> [INFO] BUILD SUCCESS
>> [INFO]
>> ------------------------------------------------------------------------
>> [INFO] Total time: 01:49 h
>> [INFO] Finished at: 2015-12-16T17:06:58-08:00
>>
>> On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:
>>
>>> +1
>>>
>>> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
>>> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
>>> <https://github.com/apache/spark/pull/10332>).
>>>
>>> Also tested on standalone client and cluster mode. No problems.
>>>
>>> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>>>
>>>> I also noticed that spark.replClassServer.host and
>>>> spark.replClassServer.port aren’t used anymore. The transport now happens
>>>> over the main RpcEnv.
>>>>
>>>> Kind regards,
>>>> Radek Gruchalski
>>>> radek@gruchalski.com <ra...@gruchalski.com>
>>>> de.linkedin.com/in/radgruchalski/
>>>>
>>>>
>>>> *Confidentiality:*This communication is intended for the above-named
>>>> person and may be confidential and/or legally privileged.
>>>> If it has come to you in error you must take no action based on it, nor
>>>> must you copy or show it to anyone; please delete/destroy and inform the
>>>> sender immediately.
>>>>
>>>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>>>
>>>> I was going to say that spark.executor.port is not used anymore in
>>>> 1.6, but damn, there's still that akka backend hanging around there
>>>> even when netty is being used... we should fix this, should be a
>>>> simple one-liner.
>>>>
>>>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <
>>>> thesinginpirate@gmail.com> wrote:
>>>>
>>>> -0 (non-binding)
>>>>
>>>> I have observed that when we set spark.executor.port in 1.6, we get
>>>> thrown a
>>>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2.
>>>> Is
>>>> anyone else seeing this?
>>>>
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Patrick Wendell <pw...@gmail.com>.

+1

On Wed, Dec 16, 2015 at 6:15 PM, Ted Yu <yu...@gmail.com> wrote:

> Ran test suite (minus docker-integration-tests)
> All passed
>
> +1
>
> [INFO] Spark Project External ZeroMQ ...................... SUCCESS [
> 13.647 s]
> [INFO] Spark Project External Kafka ....................... SUCCESS [
> 45.424 s]
> [INFO] Spark Project Examples ............................. SUCCESS [02:06
> min]
> [INFO] Spark Project External Kafka Assembly .............. SUCCESS [
> 11.280 s]
> [INFO]
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 01:49 h
> [INFO] Finished at: 2015-12-16T17:06:58-08:00
>
> On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:
>
>> +1
>>
>> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
>> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
>> <https://github.com/apache/spark/pull/10332>).
>>
>> Also tested on standalone client and cluster mode. No problems.
>>
>> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>>
>>> I also noticed that spark.replClassServer.host and
>>> spark.replClassServer.port aren’t used anymore. The transport now happens
>>> over the main RpcEnv.
>>>
>>> Kind regards,
>>> Radek Gruchalski
>>> radek@gruchalski.com <ra...@gruchalski.com>
>>> de.linkedin.com/in/radgruchalski/
>>>
>>>
>>> *Confidentiality:*This communication is intended for the above-named
>>> person and may be confidential and/or legally privileged.
>>> If it has come to you in error you must take no action based on it, nor
>>> must you copy or show it to anyone; please delete/destroy and inform the
>>> sender immediately.
>>>
>>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>>
>>> I was going to say that spark.executor.port is not used anymore in
>>> 1.6, but damn, there's still that akka backend hanging around there
>>> even when netty is being used... we should fix this, should be a
>>> simple one-liner.
>>>
>>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>>> wrote:
>>>
>>> -0 (non-binding)
>>>
>>> I have observed that when we set spark.executor.port in 1.6, we get
>>> thrown a
>>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>>> anyone else seeing this?
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Ted Yu <yu...@gmail.com>.

Ran test suite (minus docker-integration-tests)
All passed

+1

[INFO] Spark Project External ZeroMQ ...................... SUCCESS [
13.647 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [
45.424 s]
[INFO] Spark Project Examples ............................. SUCCESS [02:06
min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [
11.280 s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 01:49 h
[INFO] Finished at: 2015-12-16T17:06:58-08:00

On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:

> +1
>
> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
> <https://github.com/apache/spark/pull/10332>).
>
> Also tested on standalone client and cluster mode. No problems.
>
> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>
>> I also noticed that spark.replClassServer.host and
>> spark.replClassServer.port aren’t used anymore. The transport now happens
>> over the main RpcEnv.
>>
>> Kind regards,
>> Radek Gruchalski
>> radek@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>
>> I was going to say that spark.executor.port is not used anymore in
>> 1.6, but damn, there's still that akka backend hanging around there
>> even when netty is being used... we should fix this, should be a
>> simple one-liner.
>>
>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>> wrote:
>>
>> -0 (non-binding)
>>
>> I have observed that when we set spark.executor.port in 1.6, we get
>> thrown a
>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>> anyone else seeing this?
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Iulian Dragoș <iu...@typesafe.com>.

-0 (non-binding)

Unfortunately the Mesos cluster regression is still there (see my comment
<https://github.com/apache/spark/pull/10332/files#r47902198> for
explanations). I'm not voting to delay the release any longer though.

We tested (and passed) Mesos in:
 - client mode
 - fine/coarse-grained
 - with/without roles

iulian

On Thu, Dec 17, 2015 at 1:37 AM, Andrew Or <an...@databricks.com> wrote:

> +1
>
> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
> <https://github.com/apache/spark/pull/10332>).
>
> Also tested on standalone client and cluster mode. No problems.
>
> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>
>> I also noticed that spark.replClassServer.host and
>> spark.replClassServer.port aren’t used anymore. The transport now happens
>> over the main RpcEnv.
>>
>> Kind regards,
>> Radek Gruchalski
>> radek@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>
>> I was going to say that spark.executor.port is not used anymore in
>> 1.6, but damn, there's still that akka backend hanging around there
>> even when netty is being used... we should fix this, should be a
>> simple one-liner.
>>
>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>> wrote:
>>
>> -0 (non-binding)
>>
>> I have observed that when we set spark.executor.port in 1.6, we get
>> thrown a
>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>> anyone else seeing this?
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>


-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Andrew Or <an...@databricks.com>.

+1

Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
<https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
<https://github.com/apache/spark/pull/10332>).

Also tested on standalone client and cluster mode. No problems.

2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:

> I also noticed that spark.replClassServer.host and
> spark.replClassServer.port aren’t used anymore. The transport now happens
> over the main RpcEnv.
>
> Kind regards,
> Radek Gruchalski
> radek@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>
> I was going to say that spark.executor.port is not used anymore in
> 1.6, but damn, there's still that akka backend hanging around there
> even when netty is being used... we should fix this, should be a
> simple one-liner.
>
> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
> wrote:
>
> -0 (non-binding)
>
> I have observed that when we set spark.executor.port in 1.6, we get thrown
> a
> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
> anyone else seeing this?
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Rad Gruchalski <ra...@gruchalski.com>.

I also noticed that spark.replClassServer.host and spark.replClassServer.port aren’t used anymore. The transport now happens over the main RpcEnv.

Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.

On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:

> I was going to say that spark.executor.port is not used anymore in
> 1.6, but damn, there's still that akka backend hanging around there
> even when netty is being used... we should fix this, should be a
> simple one-liner.
>  
> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <thesinginpirate@gmail.com (mailto:thesinginpirate@gmail.com)> wrote:
> > -0 (non-binding)
> >  
> > I have observed that when we set spark.executor.port in 1.6, we get thrown a
> > NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
> > anyone else seeing this?
> >  
>  
>  
> --  
> Marcelo
>  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
>  
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Marcelo Vanzin <va...@cloudera.com>.

I was going to say that spark.executor.port is not used anymore in
1.6, but damn, there's still that akka backend hanging around there
even when netty is being used... we should fix this, should be a
simple one-liner.

On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com> wrote:
> -0 (non-binding)
>
> I have observed that when we set spark.executor.port in 1.6, we get thrown a
> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
> anyone else seeing this?

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by singinpirate <th...@gmail.com>.

-0 (non-binding)

I have observed that when we set spark.executor.port in 1.6, we get thrown
a NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
anyone else seeing this?

On Wed, Dec 16, 2015 at 2:26 PM Jiří Syrový <sy...@gmail.com> wrote:

> +1 Tested in standalone mode and so far seems to be fairly stable.
>
> 2015-12-16 22:32 GMT+01:00 Michael Armbrust <mi...@databricks.com>:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.mlalgorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Jiří Syrový <sy...@gmail.com>.

+1 Tested in standalone mode and so far seems to be fairly stable.

2015-12-16 22:32 GMT+01:00 Michael Armbrust <mi...@databricks.com>:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Timothy O <to...@yahoo-inc.com.INVALID>.

+1 


    On Thursday, December 17, 2015 8:22 AM, Kousuke Saruta <sa...@oss.nttdata.co.jp> wrote:
 

  +1
 
 On 2015/12/17 6:32, Michael Armbrust wrote:
  
  Please vote on releasing the following candidate as Apache Spark version 1.6.0! 
 The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
  
  [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package because ... 
  To learn more about Apache Spark, please see http://spark.apache.org/ 
  The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7) 
  The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/ 
  Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc 
   The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1174/  
  The test repository (versioned as v1.6.0-rc3) for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1173/ 
  The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/ 
  ======================================= == How can I help test this release? == ======================================= If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. 
  ================================================ == What justifies a -1 vote for this release? == ================================================ This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release. 
  =============================================================== == What should happen to JIRA tickets still targeting 1.6.0? == =============================================================== 1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release. 2. New features for non-alpha-modules should target 1.7+. 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version. 
  
  ================================================== == Major changes to help you focus your testing == ================================================== 
   
Notable changes since 1.6 RC2
 
 - SPARK_VERSION has been set correctly
 - SPARK-12199 ML Docs are publishing correctly
 - SPARK-12345 Mesos cluster mode has been fixed 
  
Notable changes since 1.6 RC1
 
 
Spark Streaming
    
   - SPARK-2629  trackStateByKey has been renamed to mapWithState
 
Spark SQL
    
   - SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
   - SPARK-12258 correct passing null into ScalaUDF
 
Notable Features Since 1.5
 
Spark SQL
    
   - SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
   - SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
   - SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
   - SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
   - SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
   - SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
   - SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
   - SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
   - SPARK-9241  Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
 
Spark Streaming
    
   - API Updates       
      - SPARK-2629  New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
      - SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in  memory.
      - SPARK-6328  Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
 
    
   - UI Improvements       
      - Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
      - Made output operations visible in the streaming tab as progress bars.
 
 
MLlib
 
New algorithms/models
    
   - SPARK-8518  Survival analysis - Log-linear model for survival analysis
   - SPARK-9834  Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
   - SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
   - SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
 
API improvements
    
   - ML Pipelines       
      - SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
      - SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
 
   - R API       
      - SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
      - SPARK-9681  Feature interactions in R formula - Interaction operator ":" in R formula
 
   - Python API - Many improvements to Python API to approach feature parity
 
Misc improvements
    
   - SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear Regression can take instance weights
   - SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
   - SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source    
Documentation improvements
 
   - SPARK-7751  @since versions - Documentation includes initial version when classes and methods were added
   - SPARK-11337 Testable example code - Automated testing for code in user guide examples
 
Deprecations
    
   - In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
   - In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.
 
Changes of behavior
    
   - spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by  default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
   - When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds  instead of milliseconds (SPARK-11724).
   - With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Michael Gummelt <mg...@mesosphere.io>.

The fix for the Mesos cluster regression has introduced another Mesos
cluster bug.  Namely, the MesosClusterDispatcher crashes when trying to
write to ZK: https://issues.apache.org/jira/browse/SPARK-12413

I have a tentative fix here: https://github.com/apache/spark/pull/10366

On Thu, Dec 17, 2015 at 2:07 PM, Andrew Or <an...@databricks.com> wrote:

> That seems like an HDP-specific issue. I did a quick search on "spark bad
> substitution" and all the results have to do with people failing to run
> YARN cluster in HDP. Here is a workaround
> <https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3CCABoKCLHrRLj6m3w+4Z2OQcOBr-aMZetut8AceNZgQLCs_OG_aw@mail.gmail.com%3E>
> that seems to have worked for multiple people.
>
> I would not block the release on this particular issue. First, this
> doesn't seem like a Spark issue and second, even if it is, this only
> affects a small number of users and there is a workaround for it. In my own
> testing the `extraJavaOptions` are propagated correctly in both YARN client
> and cluster modes.
>
> 2015-12-17 12:36 GMT-08:00 Sebastian YEPES FERNANDEZ <sy...@gmail.com>:
>
>> @Andrew
>> Thanks for the reply, did you run this in a Hortonworks or Cloudera
>> cluster?
>> I suspect the issue is coming from the extraJavaOptions as these are
>> necessary in HDP, the strange thing is that with exactly the same settings
>> 1.5 works.
>>
>> # jar -tf spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar | grep
>> ApplicationMaster.class
>> org/apache/spark/deploy/yarn/ApplicationMaster.class
>>
>> ----
>> Exit code: 1
>> Exception message:
>> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/syepes/appcache/application_1445706872927_1593/container_e44_1445706872927_1593_02_000001/launch_container.sh:
>> line 24:
>> /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.3.2.0-2950.jar:$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
>> bad substitution
>> -----
>>
>> Regards,
>>  Sebastian
>>
>> On Thu, Dec 17, 2015 at 9:14 PM, Andrew Or <an...@databricks.com> wrote:
>>
>>> @syepes
>>>
>>> I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to
>>> run a simple application in cluster mode successfully.
>>>
>>> Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
>>> exists in your assembly jar?
>>>
>>> jar -tf assembly.jar | grep ApplicationMaster
>>>
>>> -Andrew
>>>
>>>
>>> 2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
>>>
>>>> -1 (YARN Cluster deployment mode not working)
>>>>
>>>> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the
>>>> cluster
>>>> mode does not seem work. It looks like some parameter are not being
>>>> passed
>>>> correctly.
>>>> This example works correctly with 1.5.
>>>>
>>>> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
>>>> --properties-file $PWD/spark-props.conf --class
>>>> org.apache.spark.examples.SparkPi
>>>> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>>>>
>>>> Error: Could not find or load main class
>>>> org.apache.spark.deploy.yarn.ApplicationMaster
>>>>
>>>> spark-props.conf
>>>> -----------------------------
>>>> spark.driver.
>>>> 
>>>> extraJavaOptions                -Dhdp.version=2.3.2.0-2950
>>>> spark.driver.extraLibraryPath
>>>>
>>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>>> spark.executor.extraJavaOptions              -Dhdp.version=2.3.2.0-2950
>>>> spark.executor.extraLibraryPath
>>>>
>>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>>> -----------------------------
>>>>
>>>> I will try to do some more debugging on this issue.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>
>>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Andrew Or <an...@databricks.com>.

That seems like an HDP-specific issue. I did a quick search on "spark bad
substitution" and all the results have to do with people failing to run
YARN cluster in HDP. Here is a workaround
<https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3CCABoKCLHrRLj6m3w+4Z2OQcOBr-aMZetut8AceNZgQLCs_OG_aw@mail.gmail.com%3E>
that seems to have worked for multiple people.

I would not block the release on this particular issue. First, this doesn't
seem like a Spark issue and second, even if it is, this only affects a
small number of users and there is a workaround for it. In my own testing
the `extraJavaOptions` are propagated correctly in both YARN client and
cluster modes.

2015-12-17 12:36 GMT-08:00 Sebastian YEPES FERNANDEZ <sy...@gmail.com>:

> @Andrew
> Thanks for the reply, did you run this in a Hortonworks or Cloudera
> cluster?
> I suspect the issue is coming from the extraJavaOptions as these are
> necessary in HDP, the strange thing is that with exactly the same settings
> 1.5 works.
>
> # jar -tf spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar | grep
> ApplicationMaster.class
> org/apache/spark/deploy/yarn/ApplicationMaster.class
>
> ----
> Exit code: 1
> Exception message:
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/syepes/appcache/application_1445706872927_1593/container_e44_1445706872927_1593_02_000001/launch_container.sh:
> line 24:
> /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.3.2.0-2950.jar:$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
> bad substitution
> -----
>
> Regards,
>  Sebastian
>
> On Thu, Dec 17, 2015 at 9:14 PM, Andrew Or <an...@databricks.com> wrote:
>
>> @syepes
>>
>> I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to
>> run a simple application in cluster mode successfully.
>>
>> Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
>> exists in your assembly jar?
>>
>> jar -tf assembly.jar | grep ApplicationMaster
>>
>> -Andrew
>>
>>
>> 2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
>>
>>> -1 (YARN Cluster deployment mode not working)
>>>
>>> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
>>> mode does not seem work. It looks like some parameter are not being
>>> passed
>>> correctly.
>>> This example works correctly with 1.5.
>>>
>>> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
>>> --properties-file $PWD/spark-props.conf --class
>>> org.apache.spark.examples.SparkPi
>>> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>>>
>>> Error: Could not find or load main class
>>> org.apache.spark.deploy.yarn.ApplicationMaster
>>>
>>> spark-props.conf
>>> -----------------------------
>>> spark.driver.
>>> 
>>> extraJavaOptions                -Dhdp.version=2.3.2.0-2950
>>> spark.driver.extraLibraryPath
>>>
>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>> spark.executor.extraJavaOptions              -Dhdp.version=2.3.2.0-2950
>>> spark.executor.extraLibraryPath
>>>
>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>> -----------------------------
>>>
>>> I will try to do some more debugging on this issue.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Sebastian YEPES FERNANDEZ <sy...@gmail.com>.

@Andrew
Thanks for the reply, did you run this in a Hortonworks or Cloudera cluster?
I suspect the issue is coming from the extraJavaOptions as these are
necessary in HDP, the strange thing is that with exactly the same settings
1.5 works.

# jar -tf spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar | grep
ApplicationMaster.class
org/apache/spark/deploy/yarn/ApplicationMaster.class

----
Exit code: 1
Exception message:
/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/syepes/appcache/application_1445706872927_1593/container_e44_1445706872927_1593_02_000001/launch_container.sh:
line 24:
/usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.3.2.0-2950.jar:$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
bad substitution
-----

Regards,
 Sebastian

On Thu, Dec 17, 2015 at 9:14 PM, Andrew Or <an...@databricks.com> wrote:

> @syepes
>
> I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to
> run a simple application in cluster mode successfully.
>
> Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
> exists in your assembly jar?
>
> jar -tf assembly.jar | grep ApplicationMaster
>
> -Andrew
>
>
> 2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
>
>> -1 (YARN Cluster deployment mode not working)
>>
>> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
>> mode does not seem work. It looks like some parameter are not being passed
>> correctly.
>> This example works correctly with 1.5.
>>
>> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
>> --properties-file $PWD/spark-props.conf --class
>> org.apache.spark.examples.SparkPi
>> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>>
>> Error: Could not find or load main class
>> org.apache.spark.deploy.yarn.ApplicationMaster
>>
>> spark-props.conf
>> -----------------------------
>> spark.driver.
>> 
>> extraJavaOptions                -Dhdp.version=2.3.2.0-2950
>> spark.driver.extraLibraryPath
>>
>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>> spark.executor.extraJavaOptions              -Dhdp.version=2.3.2.0-2950
>> spark.executor.extraLibraryPath
>>
>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>> -----------------------------
>>
>> I will try to do some more debugging on this issue.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Andrew Or <an...@databricks.com>.

@syepes

I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to run
a simple application in cluster mode successfully.

Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
exists in your assembly jar?

jar -tf assembly.jar | grep ApplicationMaster

-Andrew


2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:

> -1 (YARN Cluster deployment mode not working)
>
> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
> mode does not seem work. It looks like some parameter are not being passed
> correctly.
> This example works correctly with 1.5.
>
> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
> --properties-file $PWD/spark-props.conf --class
> org.apache.spark.examples.SparkPi
> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>
> Error: Could not find or load main class
> org.apache.spark.deploy.yarn.ApplicationMaster
>
> spark-props.conf
> -----------------------------
> spark.driver.extraJavaOptions                -Dhdp.version=2.3.2.0-2950
> spark.driver.extraLibraryPath
>
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.executor.extraJavaOptions              -Dhdp.version=2.3.2.0-2950
> spark.executor.extraLibraryPath
>
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> -----------------------------
>
> I will try to do some more debugging on this issue.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Vinay Shukla <vi...@gmail.com>.

One correction, the better way is to just create a file called java-opts in
.../spark/conf with the following config value in it  -Dhdp.version=<version
of HDP>.

One way to get the HDP version is to run the below one lines on a node of
your HDP cluster.

hdp-select status hadoop-client | sed 's/hadoop-client - \(.*\)/\1/'


You can also specify the same value using SPARK_JAVA_OPTS, i.e. export
SPARK_JAVA_OPTS="-Dhdp.version=2.2.5.0-2644" 
add the following options to spark-defaults.conf: 
spark.driver.extraJavaOptions     -Dhdp.version=2.2.5.0-2644 
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.5.0-2644




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15701.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Vinay Shukla <vi...@gmail.com>.

Agree with Andrew, we shouldn't block the release for this.

This issue won't be there in Spark distribution from Hortonworks since we
set the HDP version.

If you want to use the Apache Spark with HDP you can modify mapred-site.xml
to replace the hdp.version property with the right value for your cluster.
You can find the right value by invoking the hdp-select script on a node
that has HDP installed. On my system running it returns the following:
hdp-select status hadoop-client
hadoop-client - 2.2.5.0-2644
Here is a one line script to get the version:
export HDP_VER=`hdp-select status hadoop-client | sed 's/hadoop-client -
\(.*\)/\1/'`

CAUTION - if you modify mapred-site.xml on a node on the cluster, this will
break rolling upgrades in certain scenarios where a program like oozie
submitting a job from that node will use the hardcoded version instead of
the version specified by the client.

So what does the Hortonworks distribution do under the covers to support
hdp.version?
create a file called java-opts with the following config value in it
-Dhdp.version=2.2.5.0-2644. You can also specify the same value using
SPARK_JAVA_OPTS, i.e. export SPARK_JAVA_OPTS="-Dhdp.version=2.2.5.0-2644"
add the following options to spark-defaults.conf:
spark.driver.extraJavaOptions -Dhdp.version=2.2.5.0-2644
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.5.0-2644

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15699.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by syepes <sy...@gmail.com>.

-1 (YARN Cluster deployment mode not working)

I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
mode does not seem work. It looks like some parameter are not being passed
correctly.
This example works correctly with 1.5.

# spark-submit --master yarn --deploy-mode cluster --num-executors 1 
--properties-file $PWD/spark-props.conf --class
org.apache.spark.examples.SparkPi
/opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar 

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster

spark-props.conf
-----------------------------
spark.driver.extraJavaOptions                -Dhdp.version=2.3.2.0-2950
spark.driver.extraLibraryPath               
/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraJavaOptions              -Dhdp.version=2.3.2.0-2950
spark.executor.extraLibraryPath             
/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
-----------------------------

I will try to do some more debugging on this issue.





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Kousuke Saruta <sa...@oss.nttdata.co.jp>.

+1

On 2015/12/17 6:32, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark 
> version 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and 
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc3 
> (168c89e07c51fa24b0bb88582c739cec0acb44d7) 
> <https://github.com/apache/spark/tree/v1.6.0-rc3>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/ 
> <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc3-bin/>
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be 
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc3-docs/>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking 
> an existing Spark workload and running on this release candidate, then 
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 
> votes should only occur for significant regressions from 1.5. Bugs 
> already present in 1.5, minor regressions, or bugs related to new 
> features will not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go 
> into branch-1.6, since documentations will be published separately 
> from the release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the 
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
>   Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
>
>   Notable changes since 1.6 RC1
>
>
>       Spark Streaming
>
>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>     |trackStateByKey| has been renamed to |mapWithState|
>
>
>       Spark SQL
>
>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>     SPARK-12189
>     <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs in
>     eviction of storage memory by execution.
>   * SPARK-12258
>     <https://issues.apache.org/jira/browse/SPARK-12258> correct
>     passing null into ScalaUDF
>
>
>     Notable Features Since 1.5
>
>
>       Spark SQL
>
>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>     Parquet Performance - Improve Parquet scan performance when using
>     flat schemas.
>   * SPARK-10810
>     <https://issues.apache.org/jira/browse/SPARK-10810>Session
>     Management - Isolated devault database (i.e |USE mydb|) even on
>     shared clusters.
>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>     Dataset API - A type-safe API (similar to RDDs) that performs many
>     operations on serialized binary data and code generation (i.e.
>     Project Tungsten).
>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>     Unified Memory Management - Shared memory for execution and
>     caching instead of exclusive division of the regions.
>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197>
>     SQL Queries on Files - Concise syntax for running SQL queries over
>     files of any supported format without registering a table.
>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>     Reading non-standard JSON files - Added options to read
>     non-standard JSON files (e.g. single-quotes, unquoted attributes)
>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>     Per-operator Metrics for SQL Execution - Display statistics on a
>     peroperator basis for memory usage and spilled data size.
>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329>
>     Star (*) expansion for StructTypes - Makes it easier to nest and
>     unest arbitrary numbers of columns
>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>     In-memory Columnar Cache Performance - Significant (up to 14x)
>     speed up when caching data that contains complex types in
>     DataFrames or SQL.
>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111>
>     Fast null-safe joins - Joins using null-safe equality (|<=>|) will
>     now execute using SortMergeJoin instead of computing a cartisian
>     product.
>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389>
>     SQL Execution Using Off-Heap Memory - Support for configuring
>     query execution to occur using off-heap memory to avoid GC overhead
>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>     Datasource API Avoid Double Filter - When implemeting a datasource
>     with filter pushdown, developers can now tell Spark SQL to avoid
>     double evaluating a pushed-down filter.
>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>     Advanced Layout of Cached Data - storing partitioning and ordering
>     schemes in In-memory table scan, and adding distributeBy and
>     localSort to DF API
>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>     Adaptive query execution - Intial support for automatically
>     selecting the number of reducers for joins and aggregations.
>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>     Improved query planner for queries having distinct aggregations -
>     Query plans of distinct aggregations are more robust when distinct
>     columns have high cardinality.
>
>
>       Spark Streaming
>
>   * API Updates
>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>         New improved state management - |mapWithState| - a DStream
>         transformation for stateful stream processing, supercedes
>         |updateStateByKey| in functionality and performance.
>       o SPARK-11198
>         <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>         record deaggregation - Kinesis streams have been upgraded to
>         use KCL 1.4.0 and supports transparent deaggregation of
>         KPL-aggregated records.
>       o SPARK-10891
>         <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>         message handler function - Allows arbitraray function to be
>         applied to a Kinesis record in the Kinesis receiver before to
>         customize what data is to be stored in memory.
>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>         Python Streamng Listener API - Get streaming statistics
>         (scheduling delays, batch processing times, etc.) in streaming.
>
>   * UI Improvements
>       o Made failures visible in the streaming tab, in the timelines,
>         batch list, and batch details page.
>       o Made output operations visible in the streaming tab as
>         progress bars.
>
>
>       MLlib
>
>
>         New algorithms/models
>
>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>     Survival analysis - Log-linear model for survival analysis
>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834>
>     Normal equation for least squares - Normal equation solver,
>     providing R-like model summary statistics
>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147>
>     Online hypothesis testing - A/B testing in the Spark Streaming
>     framework
>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>     transformer
>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>     Bisecting K-Means clustering - Fast top-down clustering variant of
>     K-Means
>
>
>         API improvements
>
>   * ML Pipelines
>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>         Pipeline persistence - Save/load for ML Pipelines, with
>         partial coverage of spark.ml <http://spark.ml/>algorithms
>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>         ML Pipelines
>   * R API
>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>         R-like statistics for GLMs - (Partial) R-like stats for
>         ordinary least squares via summary(model)
>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>         Feature interactions in R formula - Interaction operator ":"
>         in R formula
>   * Python API - Many improvements to Python API to approach feature
>     parity
>
>
>         Misc improvements
>
>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>     Instance weights for GLMs - Logistic and Linear Regression can
>     take instance weights
>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>     Univariate and bivariate statistics in DataFrames - Variance,
>     stddev, correlations, etc.
>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>     LIBSVM data source - LIBSVM as a SQL data source
>
>
>             Documentation improvements
>
>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751>
>     @since versions - Documentation includes initial version when
>     classes and methods were added
>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>     Testable example code - Automated testing for code in user guide
>     examples
>
>
>     Deprecations
>
>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>     deprecated.
>   * In spark.ml.classification.LogisticRegressionModel and
>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>     been deprecated, in favor of the new name "coefficients." This
>     helps disambiguate from instance (row) weights given to algorithms.
>
>
>     Changes of behavior
>
>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>     semantics in 1.6. Previously, it was a threshold for absolute
>     change in error. Now, it resembles the behavior of GradientDescent
>     convergenceTol: For large errors, it uses relative error (relative
>     to the previous error); for small errors (< 0.01), it uses
>     absolute error.
>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>     strings to lowercase before tokenizing. Now, it converts to
>     lowercase by default, with an option not to. This matches the
>     behavior of the simpler Tokenizer transformer.
>   * Spark SQL's partition discovery has been changed to only discover
>     partition directories that are children of the given path. (i.e.
>     if |path="/my/data/x=1"| then |x=1| will no longer be considered a
>     partition but only children of |x=1|.) This behavior can be
>     overridden by manually specifying the |basePath| that partitioning
>     discovery should start with (SPARK-11678
>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>   * When casting a value of an integral type to timestamp (e.g.
>     casting a long value to timestamp), the value is treated as being
>     in seconds instead of milliseconds (SPARK-11724
>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>   * With the improved query planner for queries having distinct
>     aggregations (SPARK-9241
>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>     query having a single distinct aggregation has been changed to a
>     more robust version. To switch back to the plan generated by Spark
>     1.5's planner, please set
>     |spark.sql.specializeSingleDistinctAggPlanning| to
>     |true| (SPARK-12077
>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

+1.  Ran some regression tests on Spark on Yarn (hadoop 2.6 and 2.7).
Tom 


    On Wednesday, December 16, 2015 3:32 PM, Michael Armbrust <mi...@databricks.com> wrote:
 

 Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at:https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found at:https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
========================================= How can I help test this release? =========================================If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
================================================== What justifies a -1 vote for this release? ==================================================This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.
================================================================= What should happen to JIRA tickets still targeting 1.6.0? =================================================================1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.2. New features for non-alpha-modules should target 1.7+.3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.

==================================================== Major changes to help you focus your testing ====================================================

Notable changes since 1.6 RC2

- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed

Notable changes since 1.6 RC1


Spark Streaming
   
   - SPARK-2629  trackStateByKey has been renamed to mapWithState

Spark SQL
   
   - SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
   - SPARK-12258 correct passing null into ScalaUDF

Notable Features Since 1.5

Spark SQL
   
   - SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
   - SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
   - SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
   - SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
   - SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
   - SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
   - SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
   - SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
   - SPARK-9241  Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.

Spark Streaming
   
   - API Updates      
      - SPARK-2629  New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
      - SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
      - SPARK-6328  Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.

   
   - UI Improvements      
      - Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
      - Made output operations visible in the streaming tab as progress bars.


MLlib

New algorithms/models
   
   - SPARK-8518  Survival analysis - Log-linear model for survival analysis
   - SPARK-9834  Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
   - SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
   - SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of K-Means

API improvements
   
   - ML Pipelines      
      - SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
      - SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines

   - R API      
      - SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
      - SPARK-9681  Feature interactions in R formula - Interaction operator ":" in R formula

   - Python API - Many improvements to Python API to approach feature parity

Misc improvements
   
   - SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear Regression can take instance weights
   - SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
   - SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source   
Documentation improvements

   - SPARK-7751  @since versions - Documentation includes initial version when classes and methods were added
   - SPARK-11337 Testable example code - Automated testing for code in user guide examples

Deprecations
   
   - In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
   - In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.

Changes of behavior
   
   - spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
   - When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
   - With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Mark Grover <ma...@apache.org>.

Thanks Sean for sending me the logs offline.

Turns out the tests are failing again, for reasons unrelated to Spark. I
have filed https://issues.apache.org/jira/browse/SPARK-12426 for that with
some details. In the meanwhile, I agree with Sean, these tests should be
disabled. And, again, I don't think this failures warrants blocking the
release.

Mark

On Fri, Dec 18, 2015 at 9:32 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes that's what I mean. If they're not quite working, let's disable
> them, but first, we have to rule out that I'm not just missing some
> requirement.
>
> Functionally, it's not worth blocking the release. It seems like bad
> form to release with tests that always fail for a non-trivial number
> of users, but we have to establish that. If it's something with an
> easy fix (or needs disabling) and another RC needs to be baked, might
> be worth including.
>
> Logs coming offline
>
> On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover <ma...@apache.org> wrote:
> > Sean,
> > Are you referring to docker integration tests? If so, they were disabled
> for
> > majority of the release and I recently worked on it (SPARK-11796) and
> once
> > it got committed, the tests were re-enabled in Spark builds. I am not
> sure
> > what OSs the test builds use, but it should be passing there too.
> >
> > During my work, I tested on Ubuntu Precise and they worked. If you could
> > share the logs with me offline, I could take a look. Alternatively, I can
> > try to see if I can get Ubuntu 15 instance. However, given the history of
> > these tests, I personally don't think it makes sense to block the release
> > based on them not running on Ubuntu 15.
> >
> > On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> For me, mostly the same as before: tests are mostly passing, but I can
> >> never get the docker tests to pass. If anyone knows a special profile
> >> or package that needs to be enabled, I can try that and/or
> >> fix/document it. Just wondering if it's me.
> >>
> >> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> >> -Phadoop-2.6
> >>
> >> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
> >> <mi...@databricks.com> wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 1.6.0!
> >> >
> >> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> >> > passes
> >> > if a majority of at least 3 +1 PMC votes are cast.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 1.6.0
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v1.6.0-rc3
> >> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >> >
> >> > Release artifacts are signed with the following key:
> >> > https://people.apache.org/keys/committer/pwendell.asc
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1174/
> >> >
> >> > The test repository (versioned as v1.6.0-rc3) for this release can be
> >> > found
> >> > at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1173/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >> >
> >> > =======================================
> >> > == How can I help test this release? ==
> >> > =======================================
> >> > If you are a Spark user, you can help us test this release by taking
> an
> >> > existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > ================================================
> >> > == What justifies a -1 vote for this release? ==
> >> > ================================================
> >> > This vote is happening towards the end of the 1.6 QA period, so -1
> votes
> >> > should only occur for significant regressions from 1.5. Bugs already
> >> > present
> >> > in 1.5, minor regressions, or bugs related to new features will not
> >> > block
> >> > this release.
> >> >
> >> > ===============================================================
> >> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> >> > ===============================================================
> >> > 1. It is OK for documentation patches to target 1.6.0 and still go
> into
> >> > branch-1.6, since documentations will be published separately from the
> >> > release.
> >> > 2. New features for non-alpha-modules should target 1.7+.
> >> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> >> > target
> >> > version.
> >> >
> >> >
> >> > ==================================================
> >> > == Major changes to help you focus your testing ==
> >> > ==================================================
> >> >
> >> > Notable changes since 1.6 RC2
> >> >
> >> >
> >> > - SPARK_VERSION has been set correctly
> >> > - SPARK-12199 ML Docs are publishing correctly
> >> > - SPARK-12345 Mesos cluster mode has been fixed
> >> >
> >> > Notable changes since 1.6 RC1
> >> >
> >> > Spark Streaming
> >> >
> >> > SPARK-2629  trackStateByKey has been renamed to mapWithState
> >> >
> >> > Spark SQL
> >> >
> >> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> >> > execution.
> >> > SPARK-12258 correct passing null into ScalaUDF
> >> >
> >> > Notable Features Since 1.5
> >> >
> >> > Spark SQL
> >> >
> >> > SPARK-11787 Parquet Performance - Improve Parquet scan performance
> when
> >> > using flat schemas.
> >> > SPARK-10810 Session Management - Isolated devault database (i.e USE
> >> > mydb)
> >> > even on shared clusters.
> >> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
> >> > performs
> >> > many operations on serialized binary data and code generation (i.e.
> >> > Project
> >> > Tungsten).
> >> > SPARK-10000 Unified Memory Management - Shared memory for execution
> and
> >> > caching instead of exclusive division of the regions.
> >> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
> >> > queries
> >> > over files of any supported format without registering a table.
> >> > SPARK-11745 Reading non-standard JSON files - Added options to read
> >> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> >> > SPARK-10412 Per-operator Metrics for SQL Execution - Display
> statistics
> >> > on a
> >> > peroperator basis for memory usage and spilled data size.
> >> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
> nest
> >> > and
> >> > unest arbitrary numbers of columns
> >> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> >> > Significant
> >> > (up to 14x) speed up when caching data that contains complex types in
> >> > DataFrames or SQL.
> >> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
> (<=>)
> >> > will
> >> > now execute using SortMergeJoin instead of computing a cartisian
> >> > product.
> >> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
> >> > configuring
> >> > query execution to occur using off-heap memory to avoid GC overhead
> >> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> >> > datasource with filter pushdown, developers can now tell Spark SQL to
> >> > avoid
> >> > double evaluating a pushed-down filter.
> >> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> >> > ordering schemes in In-memory table scan, and adding distributeBy and
> >> > localSort to DF API
> >> > SPARK-9858  Adaptive query execution - Intial support for
> automatically
> >> > selecting the number of reducers for joins and aggregations.
> >> > SPARK-9241  Improved query planner for queries having distinct
> >> > aggregations
> >> > - Query plans of distinct aggregations are more robust when distinct
> >> > columns
> >> > have high cardinality.
> >> >
> >> > Spark Streaming
> >> >
> >> > API Updates
> >> >
> >> > SPARK-2629  New improved state management - mapWithState - a DStream
> >> > transformation for stateful stream processing, supercedes
> >> > updateStateByKey
> >> > in functionality and performance.
> >> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> >> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> >> > KPL-aggregated records.
> >> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> >> > function to
> >> > be applied to a Kinesis record in the Kinesis receiver before to
> >> > customize
> >> > what data is to be stored in memory.
> >> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
> >> > (scheduling delays, batch processing times, etc.) in streaming.
> >> >
> >> > UI Improvements
> >> >
> >> > Made failures visible in the streaming tab, in the timelines, batch
> >> > list,
> >> > and batch details page.
> >> > Made output operations visible in the streaming tab as progress bars.
> >> >
> >> > MLlib
> >> >
> >> > New algorithms/models
> >> >
> >> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
> >> > SPARK-9834  Normal equation for least squares - Normal equation
> solver,
> >> > providing R-like model summary statistics
> >> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
> >> > Streaming
> >> > framework
> >> > SPARK-9930  New feature transformers - ChiSqSelector,
> >> > QuantileDiscretizer,
> >> > SQL transformer
> >> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
> >> > variant
> >> > of K-Means
> >> >
> >> > API improvements
> >> >
> >> > ML Pipelines
> >> >
> >> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
> >> > partial
> >> > coverage of spark.mlalgorithms
> >> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation
> in
> >> > ML
> >> > Pipelines
> >> >
> >> > R API
> >> >
> >> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
> >> > ordinary
> >> > least squares via summary(model)
> >> > SPARK-9681  Feature interactions in R formula - Interaction operator
> ":"
> >> > in
> >> > R formula
> >> >
> >> > Python API - Many improvements to Python API to approach feature
> parity
> >> >
> >> > Misc improvements
> >> >
> >> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and
> Linear
> >> > Regression can take instance weights
> >> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> >> > DataFrames -
> >> > Variance, stddev, correlations, etc.
> >> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >> >
> >> > Documentation improvements
> >> >
> >> > SPARK-7751  @since versions - Documentation includes initial version
> >> > when
> >> > classes and methods were added
> >> > SPARK-11337 Testable example code - Automated testing for code in user
> >> > guide
> >> > examples
> >> >
> >> > Deprecations
> >> >
> >> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> >> > deprecated.
> >> > In spark.ml.classification.LogisticRegressionModel and
> >> > spark.ml.regression.LinearRegressionModel, the "weights" field has
> been
> >> > deprecated, in favor of the new name "coefficients." This helps
> >> > disambiguate
> >> > from instance (row) weights given to algorithms.
> >> >
> >> > Changes of behavior
> >> >
> >> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> >> > semantics in
> >> > 1.6. Previously, it was a threshold for absolute change in error. Now,
> >> > it
> >> > resembles the behavior of GradientDescent convergenceTol: For large
> >> > errors,
> >> > it uses relative error (relative to the previous error); for small
> >> > errors (<
> >> > 0.01), it uses absolute error.
> >> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings
> >> > to
> >> > lowercase before tokenizing. Now, it converts to lowercase by default,
> >> > with
> >> > an option not to. This matches the behavior of the simpler Tokenizer
> >> > transformer.
> >> > Spark SQL's partition discovery has been changed to only discover
> >> > partition
> >> > directories that are children of the given path. (i.e. if
> >> > path="/my/data/x=1" then x=1 will no longer be considered a partition
> >> > but
> >> > only children of x=1.) This behavior can be overridden by manually
> >> > specifying the basePath that partitioning discovery should start with
> >> > (SPARK-11678).
> >> > When casting a value of an integral type to timestamp (e.g. casting a
> >> > long
> >> > value to timestamp), the value is treated as being in seconds instead
> of
> >> > milliseconds (SPARK-11724).
> >> > With the improved query planner for queries having distinct
> aggregations
> >> > (SPARK-9241), the plan of a query having a single distinct aggregation
> >> > has
> >> > been changed to a more robust version. To switch back to the plan
> >> > generated
> >> > by Spark 1.5's planner, please set
> >> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Sean Owen <so...@cloudera.com>.

Yes that's what I mean. If they're not quite working, let's disable
them, but first, we have to rule out that I'm not just missing some
requirement.

Functionally, it's not worth blocking the release. It seems like bad
form to release with tests that always fail for a non-trivial number
of users, but we have to establish that. If it's something with an
easy fix (or needs disabling) and another RC needs to be baked, might
be worth including.

Logs coming offline

On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover <ma...@apache.org> wrote:
> Sean,
> Are you referring to docker integration tests? If so, they were disabled for
> majority of the release and I recently worked on it (SPARK-11796) and once
> it got committed, the tests were re-enabled in Spark builds. I am not sure
> what OSs the test builds use, but it should be passing there too.
>
> During my work, I tested on Ubuntu Precise and they worked. If you could
> share the logs with me offline, I could take a look. Alternatively, I can
> try to see if I can get Ubuntu 15 instance. However, given the history of
> these tests, I personally don't think it makes sense to block the release
> based on them not running on Ubuntu 15.
>
> On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> For me, mostly the same as before: tests are mostly passing, but I can
>> never get the docker tests to pass. If anyone knows a special profile
>> or package that needs to be enabled, I can try that and/or
>> fix/document it. Just wondering if it's me.
>>
>> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
>> -Phadoop-2.6
>>
>> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
>> <mi...@databricks.com> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> > passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc3
>> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1174/
>> >
>> > The test repository (versioned as v1.6.0-rc3) for this release can be
>> > found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1173/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>> >
>> > =======================================
>> > == How can I help test this release? ==
>> > =======================================
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > ================================================
>> > == What justifies a -1 vote for this release? ==
>> > ================================================
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> > present
>> > in 1.5, minor regressions, or bugs related to new features will not
>> > block
>> > this release.
>> >
>> > ===============================================================
>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>> > ===============================================================
>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>> > branch-1.6, since documentations will be published separately from the
>> > release.
>> > 2. New features for non-alpha-modules should target 1.7+.
>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> > target
>> > version.
>> >
>> >
>> > ==================================================
>> > == Major changes to help you focus your testing ==
>> > ==================================================
>> >
>> > Notable changes since 1.6 RC2
>> >
>> >
>> > - SPARK_VERSION has been set correctly
>> > - SPARK-12199 ML Docs are publishing correctly
>> > - SPARK-12345 Mesos cluster mode has been fixed
>> >
>> > Notable changes since 1.6 RC1
>> >
>> > Spark Streaming
>> >
>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>> >
>> > Spark SQL
>> >
>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>> > execution.
>> > SPARK-12258 correct passing null into ScalaUDF
>> >
>> > Notable Features Since 1.5
>> >
>> > Spark SQL
>> >
>> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
>> > using flat schemas.
>> > SPARK-10810 Session Management - Isolated devault database (i.e USE
>> > mydb)
>> > even on shared clusters.
>> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that
>> > performs
>> > many operations on serialized binary data and code generation (i.e.
>> > Project
>> > Tungsten).
>> > SPARK-10000 Unified Memory Management - Shared memory for execution and
>> > caching instead of exclusive division of the regions.
>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>> > queries
>> > over files of any supported format without registering a table.
>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
>> > on a
>> > peroperator basis for memory usage and spilled data size.
>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
>> > and
>> > unest arbitrary numbers of columns
>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>> > Significant
>> > (up to 14x) speed up when caching data that contains complex types in
>> > DataFrames or SQL.
>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
>> > will
>> > now execute using SortMergeJoin instead of computing a cartisian
>> > product.
>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>> > configuring
>> > query execution to occur using off-heap memory to avoid GC overhead
>> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
>> > datasource with filter pushdown, developers can now tell Spark SQL to
>> > avoid
>> > double evaluating a pushed-down filter.
>> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
>> > ordering schemes in In-memory table scan, and adding distributeBy and
>> > localSort to DF API
>> > SPARK-9858  Adaptive query execution - Intial support for automatically
>> > selecting the number of reducers for joins and aggregations.
>> > SPARK-9241  Improved query planner for queries having distinct
>> > aggregations
>> > - Query plans of distinct aggregations are more robust when distinct
>> > columns
>> > have high cardinality.
>> >
>> > Spark Streaming
>> >
>> > API Updates
>> >
>> > SPARK-2629  New improved state management - mapWithState - a DStream
>> > transformation for stateful stream processing, supercedes
>> > updateStateByKey
>> > in functionality and performance.
>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>> > KPL-aggregated records.
>> > SPARK-10891 Kinesis message handler function - Allows arbitraray
>> > function to
>> > be applied to a Kinesis record in the Kinesis receiver before to
>> > customize
>> > what data is to be stored in memory.
>> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
>> > (scheduling delays, batch processing times, etc.) in streaming.
>> >
>> > UI Improvements
>> >
>> > Made failures visible in the streaming tab, in the timelines, batch
>> > list,
>> > and batch details page.
>> > Made output operations visible in the streaming tab as progress bars.
>> >
>> > MLlib
>> >
>> > New algorithms/models
>> >
>> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
>> > SPARK-9834  Normal equation for least squares - Normal equation solver,
>> > providing R-like model summary statistics
>> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
>> > Streaming
>> > framework
>> > SPARK-9930  New feature transformers - ChiSqSelector,
>> > QuantileDiscretizer,
>> > SQL transformer
>> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
>> > variant
>> > of K-Means
>> >
>> > API improvements
>> >
>> > ML Pipelines
>> >
>> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
>> > partial
>> > coverage of spark.mlalgorithms
>> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>> > ML
>> > Pipelines
>> >
>> > R API
>> >
>> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
>> > ordinary
>> > least squares via summary(model)
>> > SPARK-9681  Feature interactions in R formula - Interaction operator ":"
>> > in
>> > R formula
>> >
>> > Python API - Many improvements to Python API to approach feature parity
>> >
>> > Misc improvements
>> >
>> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
>> > Regression can take instance weights
>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>> > DataFrames -
>> > Variance, stddev, correlations, etc.
>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>> >
>> > Documentation improvements
>> >
>> > SPARK-7751  @since versions - Documentation includes initial version
>> > when
>> > classes and methods were added
>> > SPARK-11337 Testable example code - Automated testing for code in user
>> > guide
>> > examples
>> >
>> > Deprecations
>> >
>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> > deprecated.
>> > In spark.ml.classification.LogisticRegressionModel and
>> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> > deprecated, in favor of the new name "coefficients." This helps
>> > disambiguate
>> > from instance (row) weights given to algorithms.
>> >
>> > Changes of behavior
>> >
>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> > semantics in
>> > 1.6. Previously, it was a threshold for absolute change in error. Now,
>> > it
>> > resembles the behavior of GradientDescent convergenceTol: For large
>> > errors,
>> > it uses relative error (relative to the previous error); for small
>> > errors (<
>> > 0.01), it uses absolute error.
>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
>> > to
>> > lowercase before tokenizing. Now, it converts to lowercase by default,
>> > with
>> > an option not to. This matches the behavior of the simpler Tokenizer
>> > transformer.
>> > Spark SQL's partition discovery has been changed to only discover
>> > partition
>> > directories that are children of the given path. (i.e. if
>> > path="/my/data/x=1" then x=1 will no longer be considered a partition
>> > but
>> > only children of x=1.) This behavior can be overridden by manually
>> > specifying the basePath that partitioning discovery should start with
>> > (SPARK-11678).
>> > When casting a value of an integral type to timestamp (e.g. casting a
>> > long
>> > value to timestamp), the value is treated as being in seconds instead of
>> > milliseconds (SPARK-11724).
>> > With the improved query planner for queries having distinct aggregations
>> > (SPARK-9241), the plan of a query having a single distinct aggregation
>> > has
>> > been changed to a more robust version. To switch back to the plan
>> > generated
>> > by Spark 1.5's planner, please set
>> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Mark Grover <ma...@apache.org>.

Sean,
Are you referring to docker integration tests? If so, they were disabled
for majority of the release and I recently worked on it (SPARK-11796) and
once it got committed, the tests were re-enabled in Spark builds. I am not
sure what OSs the test builds use, but it should be passing there too.

During my work, I tested on Ubuntu Precise and they worked. If you could
share the logs with me offline, I could take a look. Alternatively, I can
try to see if I can get Ubuntu 15 instance. However, given the history of
these tests, I personally don't think it makes sense to block the release
based on them not running on Ubuntu 15.

On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen <so...@cloudera.com> wrote:

> For me, mostly the same as before: tests are mostly passing, but I can
> never get the docker tests to pass. If anyone knows a special profile
> or package that needs to be enabled, I can try that and/or
> fix/document it. Just wondering if it's me.
>
> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> -Phadoop-2.6
>
> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
> <mi...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc3
> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1174/
> >
> > The test repository (versioned as v1.6.0-rc3) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1173/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >
> > =======================================
> > == How can I help test this release? ==
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ================================================
> > == What justifies a -1 vote for this release? ==
> > ================================================
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===============================================================
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > == Major changes to help you focus your testing ==
> > ==================================================
> >
> > Notable changes since 1.6 RC2
> >
> >
> > - SPARK_VERSION has been set correctly
> > - SPARK-12199 ML Docs are publishing correctly
> > - SPARK-12345 Mesos cluster mode has been fixed
> >
> > Notable changes since 1.6 RC1
> >
> > Spark Streaming
> >
> > SPARK-2629  trackStateByKey has been renamed to mapWithState
> >
> > Spark SQL
> >
> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> execution.
> > SPARK-12258 correct passing null into ScalaUDF
> >
> > Notable Features Since 1.5
> >
> > Spark SQL
> >
> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> > using flat schemas.
> > SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> > even on shared clusters.
> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-10000 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard JSON files - Added options to read
> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
> on a
> > peroperator basis for memory usage and spilled data size.
> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
> and
> > unest arbitrary numbers of columns
> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> Significant
> > (up to 14x) speed up when caching data that contains complex types in
> > DataFrames or SQL.
> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
> will
> > now execute using SortMergeJoin instead of computing a cartisian product.
> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> > query execution to occur using off-heap memory to avoid GC overhead
> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> > datasource with filter pushdown, developers can now tell Spark SQL to
> avoid
> > double evaluating a pushed-down filter.
> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> > ordering schemes in In-memory table scan, and adding distributeBy and
> > localSort to DF API
> > SPARK-9858  Adaptive query execution - Intial support for automatically
> > selecting the number of reducers for joins and aggregations.
> > SPARK-9241  Improved query planner for queries having distinct
> aggregations
> > - Query plans of distinct aggregations are more robust when distinct
> columns
> > have high cardinality.
> >
> > Spark Streaming
> >
> > API Updates
> >
> > SPARK-2629  New improved state management - mapWithState - a DStream
> > transformation for stateful stream processing, supercedes
> updateStateByKey
> > in functionality and performance.
> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> > KPL-aggregated records.
> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> function to
> > be applied to a Kinesis record in the Kinesis receiver before to
> customize
> > what data is to be stored in memory.
> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
> > (scheduling delays, batch processing times, etc.) in streaming.
> >
> > UI Improvements
> >
> > Made failures visible in the streaming tab, in the timelines, batch list,
> > and batch details page.
> > Made output operations visible in the streaming tab as progress bars.
> >
> > MLlib
> >
> > New algorithms/models
> >
> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
> > SPARK-9834  Normal equation for least squares - Normal equation solver,
> > providing R-like model summary statistics
> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
> Streaming
> > framework
> > SPARK-9930  New feature transformers - ChiSqSelector,
> QuantileDiscretizer,
> > SQL transformer
> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
> variant
> > of K-Means
> >
> > API improvements
> >
> > ML Pipelines
> >
> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
> partial
> > coverage of spark.mlalgorithms
> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML
> > Pipelines
> >
> > R API
> >
> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
> ordinary
> > least squares via summary(model)
> > SPARK-9681  Feature interactions in R formula - Interaction operator ":"
> in
> > R formula
> >
> > Python API - Many improvements to Python API to approach feature parity
> >
> > Misc improvements
> >
> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> > Regression can take instance weights
> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> DataFrames -
> > Variance, stddev, correlations, etc.
> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >
> > Documentation improvements
> >
> > SPARK-7751  @since versions - Documentation includes initial version when
> > classes and methods were added
> > SPARK-11337 Testable example code - Automated testing for code in user
> guide
> > examples
> >
> > Deprecations
> >
> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> > In spark.ml.classification.LogisticRegressionModel and
> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
> > deprecated, in favor of the new name "coefficients." This helps
> disambiguate
> > from instance (row) weights given to algorithms.
> >
> > Changes of behavior
> >
> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in
> > 1.6. Previously, it was a threshold for absolute change in error. Now, it
> > resembles the behavior of GradientDescent convergenceTol: For large
> errors,
> > it uses relative error (relative to the previous error); for small
> errors (<
> > 0.01), it uses absolute error.
> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
> to
> > lowercase before tokenizing. Now, it converts to lowercase by default,
> with
> > an option not to. This matches the behavior of the simpler Tokenizer
> > transformer.
> > Spark SQL's partition discovery has been changed to only discover
> partition
> > directories that are children of the given path. (i.e. if
> > path="/my/data/x=1" then x=1 will no longer be considered a partition but
> > only children of x=1.) This behavior can be overridden by manually
> > specifying the basePath that partitioning discovery should start with
> > (SPARK-11678).
> > When casting a value of an integral type to timestamp (e.g. casting a
> long
> > value to timestamp), the value is treated as being in seconds instead of
> > milliseconds (SPARK-11724).
> > With the improved query planner for queries having distinct aggregations
> > (SPARK-9241), the plan of a query having a single distinct aggregation
> has
> > been changed to a more robust version. To switch back to the plan
> generated
> > by Spark 1.5's planner, please set
> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Sean Owen <so...@cloudera.com>.

For me, mostly the same as before: tests are mostly passing, but I can
never get the docker tests to pass. If anyone knows a special profile
or package that needs to be enabled, I can try that and/or
fix/document it. Just wondering if it's me.

I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
-Phadoop-2.6

On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858  Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241  Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629  New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328  Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518  Survival analysis - Log-linear model for survival analysis
> SPARK-9834  Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.mlalgorithms
> SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681  Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751  @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Zsolt Tóth <to...@gmail.com>.

+1 (non-binding)

Testing environment:
-CDH5.5 single node docker
-Prebuilt spark-1.6.0-hadoop2.6.tgz
-Yarn-cluster mode

Comparing outputs of Spark 1.5.x and 1.6.0-RC3:

Pyspark
OK?: K-Means (ml) - Note: our tests show a numerical diff here compared to
the 1.5.2 output. Since K-Means has a random factor, this can be expected
behaviour - is it because of SPARK-10779? If so, I think it should be
listed in the Mllib/ml docs.
OK: Logistic Regression (ml), Linear Regression (mllib)
OK: Nested Spark SQL query

SparkR
OK: Logistic Regression
OK: Nested Spark SQL query

Machine learning - Java:
OK: Decision Tree (mllib and ml): Gini, Entropy
OK. Random Forest (ml): Gini, Entropy
OK: Linear, Lasso, Ridge Regression (mllib)
OK: Logistc Regression (mllib): SGD, L-BFGS
OK: SVM (mllib)

I/O:
OK: Reading/Writing Parquet to/from DataFrame
OK: Reading/Writing Textfile to/from RDD

2015-12-19 2:09 GMT+01:00 Marcelo Vanzin <va...@cloudera.com>:

> +1 (non-binding)
>
> Tests the without-hadoop binaries (so didn't run Hive-related tests)
> with a test batch including standalone / client, yarn / client and
> cluster, including core, mllib and streaming (flume and kafka).
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust
> <mi...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc3
> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1174/
> >
> > The test repository (versioned as v1.6.0-rc3) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1173/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >
> > =======================================
> > == How can I help test this release? ==
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ================================================
> > == What justifies a -1 vote for this release? ==
> > ================================================
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===============================================================
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > == Major changes to help you focus your testing ==
> > ==================================================
> >
> > Notable changes since 1.6 RC2
> >
> >
> > - SPARK_VERSION has been set correctly
> > - SPARK-12199 ML Docs are publishing correctly
> > - SPARK-12345 Mesos cluster mode has been fixed
> >
> > Notable changes since 1.6 RC1
> >
> > Spark Streaming
> >
> > SPARK-2629  trackStateByKey has been renamed to mapWithState
> >
> > Spark SQL
> >
> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> execution.
> > SPARK-12258 correct passing null into ScalaUDF
> >
> > Notable Features Since 1.5
> >
> > Spark SQL
> >
> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> > using flat schemas.
> > SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> > even on shared clusters.
> > SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-10000 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard JSON files - Added options to read
> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
> on a
> > peroperator basis for memory usage and spilled data size.
> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
> and
> > unest arbitrary numbers of columns
> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> Significant
> > (up to 14x) speed up when caching data that contains complex types in
> > DataFrames or SQL.
> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
> will
> > now execute using SortMergeJoin instead of computing a cartisian product.
> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> > query execution to occur using off-heap memory to avoid GC overhead
> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> > datasource with filter pushdown, developers can now tell Spark SQL to
> avoid
> > double evaluating a pushed-down filter.
> > SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> > ordering schemes in In-memory table scan, and adding distributeBy and
> > localSort to DF API
> > SPARK-9858  Adaptive query execution - Intial support for automatically
> > selecting the number of reducers for joins and aggregations.
> > SPARK-9241  Improved query planner for queries having distinct
> aggregations
> > - Query plans of distinct aggregations are more robust when distinct
> columns
> > have high cardinality.
> >
> > Spark Streaming
> >
> > API Updates
> >
> > SPARK-2629  New improved state management - mapWithState - a DStream
> > transformation for stateful stream processing, supercedes
> updateStateByKey
> > in functionality and performance.
> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> > KPL-aggregated records.
> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> function to
> > be applied to a Kinesis record in the Kinesis receiver before to
> customize
> > what data is to be stored in memory.
> > SPARK-6328  Python Streamng Listener API - Get streaming statistics
> > (scheduling delays, batch processing times, etc.) in streaming.
> >
> > UI Improvements
> >
> > Made failures visible in the streaming tab, in the timelines, batch list,
> > and batch details page.
> > Made output operations visible in the streaming tab as progress bars.
> >
> > MLlib
> >
> > New algorithms/models
> >
> > SPARK-8518  Survival analysis - Log-linear model for survival analysis
> > SPARK-9834  Normal equation for least squares - Normal equation solver,
> > providing R-like model summary statistics
> > SPARK-3147  Online hypothesis testing - A/B testing in the Spark
> Streaming
> > framework
> > SPARK-9930  New feature transformers - ChiSqSelector,
> QuantileDiscretizer,
> > SQL transformer
> > SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering
> variant
> > of K-Means
> >
> > API improvements
> >
> > ML Pipelines
> >
> > SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with
> partial
> > coverage of spark.mlalgorithms
> > SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML
> > Pipelines
> >
> > R API
> >
> > SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for
> ordinary
> > least squares via summary(model)
> > SPARK-9681  Feature interactions in R formula - Interaction operator ":"
> in
> > R formula
> >
> > Python API - Many improvements to Python API to approach feature parity
> >
> > Misc improvements
> >
> > SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> > Regression can take instance weights
> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> DataFrames -
> > Variance, stddev, correlations, etc.
> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >
> > Documentation improvements
> >
> > SPARK-7751  @since versions - Documentation includes initial version when
> > classes and methods were added
> > SPARK-11337 Testable example code - Automated testing for code in user
> guide
> > examples
> >
> > Deprecations
> >
> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> > In spark.ml.classification.LogisticRegressionModel and
> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
> > deprecated, in favor of the new name "coefficients." This helps
> disambiguate
> > from instance (row) weights given to algorithms.
> >
> > Changes of behavior
> >
> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in
> > 1.6. Previously, it was a threshold for absolute change in error. Now, it
> > resembles the behavior of GradientDescent convergenceTol: For large
> errors,
> > it uses relative error (relative to the previous error); for small
> errors (<
> > 0.01), it uses absolute error.
> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
> to
> > lowercase before tokenizing. Now, it converts to lowercase by default,
> with
> > an option not to. This matches the behavior of the simpler Tokenizer
> > transformer.
> > Spark SQL's partition discovery has been changed to only discover
> partition
> > directories that are children of the given path. (i.e. if
> > path="/my/data/x=1" then x=1 will no longer be considered a partition but
> > only children of x=1.) This behavior can be overridden by manually
> > specifying the basePath that partitioning discovery should start with
> > (SPARK-11678).
> > When casting a value of an integral type to timestamp (e.g. casting a
> long
> > value to timestamp), the value is treated as being in seconds instead of
> > milliseconds (SPARK-11724).
> > With the improved query planner for queries having distinct aggregations
> > (SPARK-9241), the plan of a query having a single distinct aggregation
> has
> > been changed to a more robust version. To switch back to the plan
> generated
> > by Spark 1.5's planner, please set
> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Marcelo Vanzin <va...@cloudera.com>.

+1 (non-binding)

Tests the without-hadoop binaries (so didn't run Hive-related tests)
with a test batch including standalone / client, yarn / client and
cluster, including core, mllib and streaming (flume and kafka).

On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849  Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858  Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241  Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629  New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328  Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518  Survival analysis - Log-linear model for survival analysis
> SPARK-9834  Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.mlalgorithms
> SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681  Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751  @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Saisai Shao <sa...@gmail.com>.

+1 (non-binding) after SPARK-12345 is merged.

On Thu, Dec 17, 2015 at 9:55 AM, Allen Zhang <al...@126.com> wrote:

> plus 1
>
>
>
>
>
>
> 在 2015-12-17 09:39:39，"Joseph Bradley" <jo...@databricks.com> 写道：
>
> +1
>
> On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> +1
>>
>>
>> On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> +1
>>>
>>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc3
>>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>>
>>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Notable changes since 1.6 RC2
>>>> - SPARK_VERSION has been set correctly
>>>> - SPARK-12199 ML Docs are publishing correctly
>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>
>>>> Notable changes since 1.6 RC1
>>>> Spark Streaming
>>>>
>>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>>    trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>>    bugs in eviction of storage memory by execution.
>>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>>    passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>>    Performance - Improve Parquet scan performance when using flat
>>>>    schemas.
>>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>>    on shared clusters.
>>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>>    API - A type-safe API (similar to RDDs) that performs many
>>>>    operations on serialized binary data and code generation (i.e. Project
>>>>    Tungsten).
>>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>>    Memory Management - Shared memory for execution and caching instead
>>>>    of exclusive division of the regions.
>>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>>    Queries on Files - Concise syntax for running SQL queries over
>>>>    files of any supported format without registering a table.
>>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>>    non-standard JSON files - Added options to read non-standard JSON
>>>>    files (e.g. single-quotes, unquoted attributes)
>>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>>    basis for memory usage and spilled data size.
>>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>>    arbitrary numbers of columns
>>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>>    caching data that contains complex types in DataFrames or SQL.
>>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>>    execution to occur using off-heap memory to avoid GC overhead
>>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>>    pushed-down filter.
>>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>>    Layout of Cached Data - storing partitioning and ordering schemes
>>>>    in In-memory table scan, and adding distributeBy and localSort to DF API
>>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>>    query execution - Intial support for automatically selecting the
>>>>    number of reducers for joins and aggregations.
>>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>>    query planner for queries having distinct aggregations - Query
>>>>    plans of distinct aggregations are more robust when distinct columns have
>>>>    high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>>    - API Updates
>>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>>       improved state management - mapWithState - a DStream
>>>>       transformation for stateful stream processing, supercedes
>>>>       updateStateByKey in functionality and performance.
>>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>>        Kinesis record deaggregation - Kinesis streams have been
>>>>       upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>>       KPL-aggregated records.
>>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>>        Kinesis message handler function - Allows arbitraray function
>>>>       to be applied to a Kinesis record in the Kinesis receiver before to
>>>>       customize what data is to be stored in memory.
>>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>>       delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>>    - UI Improvements
>>>>       - Made failures visible in the streaming tab, in the timelines,
>>>>       batch list, and batch details page.
>>>>       - Made output operations visible in the streaming tab as
>>>>       progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>>    analysis - Log-linear model for survival analysis
>>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>>    equation for least squares - Normal equation solver, providing
>>>>    R-like model summary statistics
>>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>>    transformer
>>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>>    - ML Pipelines
>>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>>       of spark.mlalgorithms
>>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>>       Pipelines
>>>>    - R API
>>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>>       squares via summary(model)
>>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>>       interactions in R formula - Interaction operator ":" in R formula
>>>>    - Python API - Many improvements to Python API to approach feature
>>>>    parity
>>>>
>>>> Misc improvements
>>>>
>>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>>    weights
>>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>>    correlations, etc.
>>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>>    versions - Documentation includes initial version when classes and
>>>>    methods were added
>>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>>    example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>>    deprecated.
>>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>>    deprecated, in favor of the new name "coefficients." This helps
>>>>    disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>>    For large errors, it uses relative error (relative to the previous error);
>>>>    for small errors (< 0.01), it uses absolute error.
>>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>>    default, with an option not to. This matches the behavior of the simpler
>>>>    Tokenizer transformer.
>>>>    - Spark SQL's partition discovery has been changed to only discover
>>>>    partition directories that are children of the given path. (i.e. if
>>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>>    partition but only children of x=1.) This behavior can be
>>>>    overridden by manually specifying the basePath that partitioning
>>>>    discovery should start with (SPARK-11678
>>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>>    - When casting a value of an integral type to timestamp (e.g.
>>>>    casting a long value to timestamp), the value is treated as being in
>>>>    seconds instead of milliseconds (SPARK-11724
>>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>>    - With the improved query planner for queries having distinct
>>>>    aggregations (SPARK-9241
>>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>>    query having a single distinct aggregation has been changed to a more
>>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>>     to true (SPARK-12077
>>>>    <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>
>>
>

Re:Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Allen Zhang <al...@126.com>.

plus 1






在 2015-12-17 09:39:39，"Joseph Bradley" <jo...@databricks.com> 写道：

+1


On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <rx...@databricks.com> wrote:

+1




On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:

+1


On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.6.0!

The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.



[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/


The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)


The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/


Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc


The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1174/


The test repository (versioned as v1.6.0-rc3) for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1173/


The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/


=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.


================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.


===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.




==================================================
== Major changes to help you focus your testing ==
==================================================


Notable changes since 1.6 RC2

- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed


Notable changes since 1.6 RC1

Spark Streaming
SPARK-2629  trackStateByKey has been renamed to mapWithState
Spark SQL
SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
SPARK-9858  Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
SPARK-9241  Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
Spark Streaming
API Updates
SPARK-2629  New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
SPARK-6328  Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
UI Improvements
Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
SPARK-8518  Survival analysis - Log-linear model for survival analysis
SPARK-9834  Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming framework
SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
ML Pipelines
SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
SPARK-9681  Feature interactions in R formula - Interaction operator ":" in R formula
Python API - Many improvements to Python API to approach feature parity
Misc improvements
SPARK-7685 , SPARK-9642  Instance weights for GLMs - Logistic and Linear Regression can take instance weights
SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
SPARK-7751  @since versions - Documentation includes initial version when classes and methods were added
SPARK-11337 Testable example code - Automated testing for code in user guide examples
Deprecations
In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.
Changes of behavior
spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Joseph Bradley <jo...@databricks.com>.

+1

On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <rx...@databricks.com> wrote:

> +1
>
>
> On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> +1
>>
>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc3
>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>
>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>>    trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>    bugs in eviction of storage memory by execution.
>>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>    passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>    Performance - Improve Parquet scan performance when using flat
>>>    schemas.
>>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>    Session Management - Isolated devault database (i.e USE mydb) even
>>>    on shared clusters.
>>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>    API - A type-safe API (similar to RDDs) that performs many
>>>    operations on serialized binary data and code generation (i.e. Project
>>>    Tungsten).
>>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>    Memory Management - Shared memory for execution and caching instead
>>>    of exclusive division of the regions.
>>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>    Queries on Files - Concise syntax for running SQL queries over files
>>>    of any supported format without registering a table.
>>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>    non-standard JSON files - Added options to read non-standard JSON
>>>    files (e.g. single-quotes, unquoted attributes)
>>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>    Metrics for SQL Execution - Display statistics on a peroperator
>>>    basis for memory usage and spilled data size.
>>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>>    arbitrary numbers of columns
>>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>>    caching data that contains complex types in DataFrames or SQL.
>>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>>    execute using SortMergeJoin instead of computing a cartisian product.
>>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>    Execution Using Off-Heap Memory - Support for configuring query
>>>    execution to occur using off-heap memory to avoid GC overhead
>>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>    API Avoid Double Filter - When implemeting a datasource with filter
>>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>    pushed-down filter.
>>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>    query execution - Intial support for automatically selecting the
>>>    number of reducers for joins and aggregations.
>>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>    query planner for queries having distinct aggregations - Query plans
>>>    of distinct aggregations are more robust when distinct columns have high
>>>    cardinality.
>>>
>>> Spark Streaming
>>>
>>>    - API Updates
>>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>       improved state management - mapWithState - a DStream
>>>       transformation for stateful stream processing, supercedes
>>>       updateStateByKey in functionality and performance.
>>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>>       record deaggregation - Kinesis streams have been upgraded to use
>>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>>       message handler function - Allows arbitraray function to be
>>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>>       what data is to be stored in memory.
>>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>       Streamng Listener API - Get streaming statistics (scheduling
>>>       delays, batch processing times, etc.) in streaming.
>>>
>>>
>>>    - UI Improvements
>>>       - Made failures visible in the streaming tab, in the timelines,
>>>       batch list, and batch details page.
>>>       - Made output operations visible in the streaming tab as progress
>>>       bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>    analysis - Log-linear model for survival analysis
>>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>    equation for least squares - Normal equation solver, providing
>>>    R-like model summary statistics
>>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>    transformer
>>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>>    - ML Pipelines
>>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>       persistence - Save/load for ML Pipelines, with partial coverage
>>>       of spark.mlalgorithms
>>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>       Pipelines
>>>    - R API
>>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>>       squares via summary(model)
>>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>       interactions in R formula - Interaction operator ":" in R formula
>>>    - Python API - Many improvements to Python API to approach feature
>>>    parity
>>>
>>> Misc improvements
>>>
>>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>    weights for GLMs - Logistic and Linear Regression can take instance
>>>    weights
>>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>    and bivariate statistics in DataFrames - Variance, stddev,
>>>    correlations, etc.
>>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>    versions - Documentation includes initial version when classes and
>>>    methods were added
>>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>    example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>    deprecated.
>>>    - In spark.ml.classification.LogisticRegressionModel and
>>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>    deprecated, in favor of the new name "coefficients." This helps
>>>    disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>    For large errors, it uses relative error (relative to the previous error);
>>>    for small errors (< 0.01), it uses absolute error.
>>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>    default, with an option not to. This matches the behavior of the simpler
>>>    Tokenizer transformer.
>>>    - Spark SQL's partition discovery has been changed to only discover
>>>    partition directories that are children of the given path. (i.e. if
>>>    path="/my/data/x=1" then x=1 will no longer be considered a
>>>    partition but only children of x=1.) This behavior can be overridden
>>>    by manually specifying the basePath that partitioning discovery
>>>    should start with (SPARK-11678
>>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>    - When casting a value of an integral type to timestamp (e.g.
>>>    casting a long value to timestamp), the value is treated as being in
>>>    seconds instead of milliseconds (SPARK-11724
>>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>    - With the improved query planner for queries having distinct
>>>    aggregations (SPARK-9241
>>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>    query having a single distinct aggregation has been changed to a more
>>>    robust version. To switch back to the plan generated by Spark 1.5's
>>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>>    ).
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Reynold Xin <rx...@databricks.com>.

+1


On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> +1
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>>    trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>    bugs in eviction of storage memory by execution.
>>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>    passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>    Performance - Improve Parquet scan performance when using flat
>>    schemas.
>>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>    Session Management - Isolated devault database (i.e USE mydb) even on
>>    shared clusters.
>>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>    API - A type-safe API (similar to RDDs) that performs many operations
>>    on serialized binary data and code generation (i.e. Project Tungsten).
>>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>    Memory Management - Shared memory for execution and caching instead
>>    of exclusive division of the regions.
>>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>    Queries on Files - Concise syntax for running SQL queries over files
>>    of any supported format without registering a table.
>>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>    non-standard JSON files - Added options to read non-standard JSON
>>    files (e.g. single-quotes, unquoted attributes)
>>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>    Metrics for SQL Execution - Display statistics on a peroperator basis
>>    for memory usage and spilled data size.
>>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>    (*) expansion for StructTypes - Makes it easier to nest and unest
>>    arbitrary numbers of columns
>>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>    Columnar Cache Performance - Significant (up to 14x) speed up when
>>    caching data that contains complex types in DataFrames or SQL.
>>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>    null-safe joins - Joins using null-safe equality (<=>) will now
>>    execute using SortMergeJoin instead of computing a cartisian product.
>>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>    Execution Using Off-Heap Memory - Support for configuring query
>>    execution to occur using off-heap memory to avoid GC overhead
>>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>    API Avoid Double Filter - When implemeting a datasource with filter
>>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>    pushed-down filter.
>>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>    Layout of Cached Data - storing partitioning and ordering schemes in
>>    In-memory table scan, and adding distributeBy and localSort to DF API
>>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>    query execution - Intial support for automatically selecting the
>>    number of reducers for joins and aggregations.
>>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>    query planner for queries having distinct aggregations - Query plans
>>    of distinct aggregations are more robust when distinct columns have high
>>    cardinality.
>>
>> Spark Streaming
>>
>>    - API Updates
>>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>>       improved state management - mapWithState - a DStream
>>       transformation for stateful stream processing, supercedes
>>       updateStateByKey in functionality and performance.
>>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>       record deaggregation - Kinesis streams have been upgraded to use
>>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>       message handler function - Allows arbitraray function to be
>>       applied to a Kinesis record in the Kinesis receiver before to customize
>>       what data is to be stored in memory.
>>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>       Streamng Listener API - Get streaming statistics (scheduling
>>       delays, batch processing times, etc.) in streaming.
>>
>>
>>    - UI Improvements
>>       - Made failures visible in the streaming tab, in the timelines,
>>       batch list, and batch details page.
>>       - Made output operations visible in the streaming tab as progress
>>       bars.
>>
>> MLlibNew algorithms/models
>>
>>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>    analysis - Log-linear model for survival analysis
>>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>    equation for least squares - Normal equation solver, providing R-like
>>    model summary statistics
>>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>    hypothesis testing - A/B testing in the Spark Streaming framework
>>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>    transformer
>>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>    K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>>    - ML Pipelines
>>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>       persistence - Save/load for ML Pipelines, with partial coverage of
>>       spark.mlalgorithms
>>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>       Pipelines
>>    - R API
>>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>       statistics for GLMs - (Partial) R-like stats for ordinary least
>>       squares via summary(model)
>>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>       interactions in R formula - Interaction operator ":" in R formula
>>    - Python API - Many improvements to Python API to approach feature
>>    parity
>>
>> Misc improvements
>>
>>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>    weights for GLMs - Logistic and Linear Regression can take instance
>>    weights
>>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>    and bivariate statistics in DataFrames - Variance, stddev,
>>    correlations, etc.
>>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>    versions - Documentation includes initial version when classes and
>>    methods were added
>>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>    example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>    deprecated.
>>    - In spark.ml.classification.LogisticRegressionModel and
>>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>    deprecated, in favor of the new name "coefficients." This helps
>>    disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>    semantics in 1.6. Previously, it was a threshold for absolute change in
>>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>    For large errors, it uses relative error (relative to the previous error);
>>    for small errors (< 0.01), it uses absolute error.
>>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>>    default, with an option not to. This matches the behavior of the simpler
>>    Tokenizer transformer.
>>    - Spark SQL's partition discovery has been changed to only discover
>>    partition directories that are children of the given path. (i.e. if
>>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>>    but only children of x=1.) This behavior can be overridden by
>>    manually specifying the basePath that partitioning discovery should
>>    start with (SPARK-11678
>>    <https://issues.apache.org/jira/browse/SPARK-11678>).
>>    - When casting a value of an integral type to timestamp (e.g. casting
>>    a long value to timestamp), the value is treated as being in seconds
>>    instead of milliseconds (SPARK-11724
>>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>>    - With the improved query planner for queries having distinct
>>    aggregations (SPARK-9241
>>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>    query having a single distinct aggregation has been changed to a more
>>    robust version. To switch back to the plan generated by Spark 1.5's
>>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>    ).
>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

Posted by Mark Hamstra <ma...@clearstorydata.com>.

+1

On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>    - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
>    trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>    - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>    SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>    bugs in eviction of storage memory by execution.
>    - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>    passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>    - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>    Performance - Improve Parquet scan performance when using flat schemas.
>    - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>    Session Management - Isolated devault database (i.e USE mydb) even on
>    shared clusters.
>    - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>    API - A type-safe API (similar to RDDs) that performs many operations
>    on serialized binary data and code generation (i.e. Project Tungsten).
>    - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>    Memory Management - Shared memory for execution and caching instead of
>    exclusive division of the regions.
>    - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>    Queries on Files - Concise syntax for running SQL queries over files
>    of any supported format without registering a table.
>    - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>    non-standard JSON files - Added options to read non-standard JSON
>    files (e.g. single-quotes, unquoted attributes)
>    - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>    Metrics for SQL Execution - Display statistics on a peroperator basis
>    for memory usage and spilled data size.
>    - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>    (*) expansion for StructTypes - Makes it easier to nest and unest
>    arbitrary numbers of columns
>    - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>    Columnar Cache Performance - Significant (up to 14x) speed up when
>    caching data that contains complex types in DataFrames or SQL.
>    - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>    null-safe joins - Joins using null-safe equality (<=>) will now
>    execute using SortMergeJoin instead of computing a cartisian product.
>    - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>    Execution Using Off-Heap Memory - Support for configuring query
>    execution to occur using off-heap memory to avoid GC overhead
>    - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>    API Avoid Double Filter - When implemeting a datasource with filter
>    pushdown, developers can now tell Spark SQL to avoid double evaluating a
>    pushed-down filter.
>    - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>    Layout of Cached Data - storing partitioning and ordering schemes in
>    In-memory table scan, and adding distributeBy and localSort to DF API
>    - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>    query execution - Intial support for automatically selecting the
>    number of reducers for joins and aggregations.
>    - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>    query planner for queries having distinct aggregations - Query plans
>    of distinct aggregations are more robust when distinct columns have high
>    cardinality.
>
> Spark Streaming
>
>    - API Updates
>       - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
>       improved state management - mapWithState - a DStream transformation
>       for stateful stream processing, supercedes updateStateByKey in
>       functionality and performance.
>       - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>       record deaggregation - Kinesis streams have been upgraded to use
>       KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>       - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>       message handler function - Allows arbitraray function to be applied
>       to a Kinesis record in the Kinesis receiver before to customize what data
>       is to be stored in memory.
>       - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
>       Streamng Listener API - Get streaming statistics (scheduling
>       delays, batch processing times, etc.) in streaming.
>
>
>    - UI Improvements
>       - Made failures visible in the streaming tab, in the timelines,
>       batch list, and batch details page.
>       - Made output operations visible in the streaming tab as progress
>       bars.
>
> MLlibNew algorithms/models
>
>    - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>    analysis - Log-linear model for survival analysis
>    - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>    equation for least squares - Normal equation solver, providing R-like
>    model summary statistics
>    - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
>    hypothesis testing - A/B testing in the Spark Streaming framework
>    - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
>    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>    transformer
>    - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>    K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
>    - ML Pipelines
>       - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>       persistence - Save/load for ML Pipelines, with partial coverage of
>       spark.mlalgorithms
>       - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>       in ML Pipelines - API for Latent Dirichlet Allocation in ML
>       Pipelines
>    - R API
>       - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>       statistics for GLMs - (Partial) R-like stats for ordinary least
>       squares via summary(model)
>       - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>       interactions in R formula - Interaction operator ":" in R formula
>    - Python API - Many improvements to Python API to approach feature
>    parity
>
> Misc improvements
>
>    - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>    SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>    weights for GLMs - Logistic and Linear Regression can take instance
>    weights
>    - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>    and bivariate statistics in DataFrames - Variance, stddev,
>    correlations, etc.
>    - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>    data source - LIBSVM as a SQL data sourceDocumentation improvements
>    - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>    versions - Documentation includes initial version when classes and
>    methods were added
>    - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>    example code - Automated testing for code in user guide examples
>
> Deprecations
>
>    - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>    deprecated.
>    - In spark.ml.classification.LogisticRegressionModel and
>    spark.ml.regression.LinearRegressionModel, the "weights" field has been
>    deprecated, in favor of the new name "coefficients." This helps
>    disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>    - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>    semantics in 1.6. Previously, it was a threshold for absolute change in
>    error. Now, it resembles the behavior of GradientDescent convergenceTol:
>    For large errors, it uses relative error (relative to the previous error);
>    for small errors (< 0.01), it uses absolute error.
>    - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>    strings to lowercase before tokenizing. Now, it converts to lowercase by
>    default, with an option not to. This matches the behavior of the simpler
>    Tokenizer transformer.
>    - Spark SQL's partition discovery has been changed to only discover
>    partition directories that are children of the given path. (i.e. if
>    path="/my/data/x=1" then x=1 will no longer be considered a partition
>    but only children of x=1.) This behavior can be overridden by manually
>    specifying the basePath that partitioning discovery should start with (
>    SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>    - When casting a value of an integral type to timestamp (e.g. casting
>    a long value to timestamp), the value is treated as being in seconds
>    instead of milliseconds (SPARK-11724
>    <https://issues.apache.org/jira/browse/SPARK-11724>).
>    - With the improved query planner for queries having distinct
>    aggregations (SPARK-9241
>    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>    query having a single distinct aggregation has been changed to a more
>    robust version. To switch back to the plan generated by Spark 1.5's
>    planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>    true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>