You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2015/12/16 22:32:14 UTC
[VOTE] Release Apache Spark 1.6.0 (RC3)
Please vote on releasing the following candidate as Apache Spark version
1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is *v1.6.0-rc3
(168c89e07c51fa24b0bb88582c739cec0acb44d7)
<https://github.com/apache/spark/tree/v1.6.0-rc3>*
The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.
================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.
===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.
==================================================
== Major changes to help you focus your testing ==
==================================================
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
- SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
trackStateByKey has been renamed to mapWithState
Spark SQL
- SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs
in eviction of storage memory by execution.
- SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
passing null into ScalaUDF
Notable Features Since 1.5Spark SQL
- SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
Performance - Improve Parquet scan performance when using flat schemas.
- SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
Session Management - Isolated devault database (i.e USE mydb) even on
shared clusters.
- SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
API - A type-safe API (similar to RDDs) that performs many operations on
serialized binary data and code generation (i.e. Project Tungsten).
- SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
Memory Management - Shared memory for execution and caching instead of
exclusive division of the regions.
- SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
Queries on Files - Concise syntax for running SQL queries over files of
any supported format without registering a table.
- SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
non-standard JSON files - Added options to read non-standard JSON files
(e.g. single-quotes, unquoted attributes)
- SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator
Metrics for SQL Execution - Display statistics on a peroperator basis
for memory usage and spilled data size.
- SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
(*) expansion for StructTypes - Makes it easier to nest and unest
arbitrary numbers of columns
- SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
Columnar Cache Performance - Significant (up to 14x) speed up when
caching data that contains complex types in DataFrames or SQL.
- SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
null-safe joins - Joins using null-safe equality (<=>) will now execute
using SortMergeJoin instead of computing a cartisian product.
- SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
Execution Using Off-Heap Memory - Support for configuring query
execution to occur using off-heap memory to avoid GC overhead
- SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
API Avoid Double Filter - When implemeting a datasource with filter
pushdown, developers can now tell Spark SQL to avoid double evaluating a
pushed-down filter.
- SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
Layout of Cached Data - storing partitioning and ordering schemes in
In-memory table scan, and adding distributeBy and localSort to DF API
- SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
query execution - Intial support for automatically selecting the number
of reducers for joins and aggregations.
- SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
query planner for queries having distinct aggregations - Query plans of
distinct aggregations are more robust when distinct columns have high
cardinality.
Spark Streaming
- API Updates
- SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
improved state management - mapWithState - a DStream transformation
for stateful stream processing, supercedes updateStateByKey in
functionality and performance.
- SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
record deaggregation - Kinesis streams have been upgraded to use KCL
1.4.0 and supports transparent deaggregation of KPL-aggregated records.
- SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
message handler function - Allows arbitraray function to be applied
to a Kinesis record in the Kinesis receiver before to customize what data
is to be stored in memory.
- SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
Streamng Listener API - Get streaming statistics (scheduling delays,
batch processing times, etc.) in streaming.
- UI Improvements
- Made failures visible in the streaming tab, in the timelines, batch
list, and batch details page.
- Made output operations visible in the streaming tab as progress
bars.
MLlibNew algorithms/models
- SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
analysis - Log-linear model for survival analysis
- SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
equation for least squares - Normal equation solver, providing R-like
model summary statistics
- SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
hypothesis testing - A/B testing in the Spark Streaming framework
- SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
transformer
- SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
- ML Pipelines
- SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
persistence - Save/load for ML Pipelines, with partial coverage of
spark.mlalgorithms
- SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
- R API
- SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
statistics for GLMs - (Partial) R-like stats for ordinary least
squares via summary(model)
- SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
interactions in R formula - Interaction operator ":" in R formula
- Python API - Many improvements to Python API to approach feature parity
Misc improvements
- SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
weights for GLMs - Logistic and Linear Regression can take instance
weights
- SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
and bivariate statistics in DataFrames - Variance, stddev, correlations,
etc.
- SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
data source - LIBSVM as a SQL data sourceDocumentation improvements
- SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
versions - Documentation includes initial version when classes and
methods were added
- SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
example code - Automated testing for code in user guide examples
Deprecations
- In spark.mllib.clustering.KMeans, the "runs" parameter has been
deprecated.
- In spark.ml.classification.LogisticRegressionModel and
spark.ml.regression.LinearRegressionModel, the "weights" field has been
deprecated, in favor of the new name "coefficients." This helps
disambiguate from instance (row) weights given to algorithms.
Changes of behavior
- spark.mllib.tree.GradientBoostedTrees validationTol has changed
semantics in 1.6. Previously, it was a threshold for absolute change in
error. Now, it resembles the behavior of GradientDescent convergenceTol:
For large errors, it uses relative error (relative to the previous error);
for small errors (< 0.01), it uses absolute error.
- spark.ml.feature.RegexTokenizer: Previously, it did not convert
strings to lowercase before tokenizing. Now, it converts to lowercase by
default, with an option not to. This matches the behavior of the simpler
Tokenizer transformer.
- Spark SQL's partition discovery has been changed to only discover
partition directories that are children of the given path. (i.e. if
path="/my/data/x=1" then x=1 will no longer be considered a partition
but only children of x=1.) This behavior can be overridden by manually
specifying the basePath that partitioning discovery should start with (
SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
- When casting a value of an integral type to timestamp (e.g. casting a
long value to timestamp), the value is treated as being in seconds instead
of milliseconds (SPARK-11724
<https://issues.apache.org/jira/browse/SPARK-11724>).
- With the improved query planner for queries having distinct
aggregations (SPARK-9241
<https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a query
having a single distinct aggregation has been changed to a more robust
version. To switch back to the plan generated by Spark 1.5's planner,
please set spark.sql.specializeSingleDistinctAggPlanning to true (
SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Daniel Darabos <da...@lynxanalytics.com>.
+1 (non-binding)
It passes our tests after we registered 6 new classes with Kryo:
kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow])
kryo.register(classOf[Array[org.apache.spark.mllib.tree.model.Split]])
kryo.register(Class.forName("org.apache.spark.mllib.tree.model.Bin"))
kryo.register(Class.forName("[Lorg.apache.spark.mllib.tree.model.Bin;"))
kryo.register(Class.forName("org.apache.spark.mllib.tree.model.DummyLowSplit"))
kryo.register(Class.forName("org.apache.spark.mllib.tree.model.DummyHighSplit"))
It also spams "Managed memory leak detected; size = 15735058 bytes, TID =
847" for almost every task. I haven't yet figured out why.
On Fri, Dec 18, 2015 at 6:45 AM, Krishna Sankar <ks...@gmail.com> wrote:
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
> mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0)
> 2.0 Spark version is 1.6.0
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
> Center And Scale OK
> 2.5. RDD operations OK
> State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
> Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK (--packages
> com.databricks:spark-csv_2.10:1.3.0)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Cheers & Good work guys
> <k/>
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>> trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>> bugs in eviction of storage memory by execution.
>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>> passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>> Performance - Improve Parquet scan performance when using flat
>> schemas.
>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>> Session Management - Isolated devault database (i.e USE mydb) even on
>> shared clusters.
>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>> API - A type-safe API (similar to RDDs) that performs many operations
>> on serialized binary data and code generation (i.e. Project Tungsten).
>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>> Memory Management - Shared memory for execution and caching instead
>> of exclusive division of the regions.
>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>> Queries on Files - Concise syntax for running SQL queries over files
>> of any supported format without registering a table.
>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>> non-standard JSON files - Added options to read non-standard JSON
>> files (e.g. single-quotes, unquoted attributes)
>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>> Metrics for SQL Execution - Display statistics on a peroperator basis
>> for memory usage and spilled data size.
>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>> (*) expansion for StructTypes - Makes it easier to nest and unest
>> arbitrary numbers of columns
>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>> Columnar Cache Performance - Significant (up to 14x) speed up when
>> caching data that contains complex types in DataFrames or SQL.
>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>> null-safe joins - Joins using null-safe equality (<=>) will now
>> execute using SortMergeJoin instead of computing a cartisian product.
>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>> Execution Using Off-Heap Memory - Support for configuring query
>> execution to occur using off-heap memory to avoid GC overhead
>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>> API Avoid Double Filter - When implemeting a datasource with filter
>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>> pushed-down filter.
>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>> Layout of Cached Data - storing partitioning and ordering schemes in
>> In-memory table scan, and adding distributeBy and localSort to DF API
>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>> query execution - Intial support for automatically selecting the
>> number of reducers for joins and aggregations.
>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>> query planner for queries having distinct aggregations - Query plans
>> of distinct aggregations are more robust when distinct columns have high
>> cardinality.
>>
>> Spark Streaming
>>
>> - API Updates
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>> improved state management - mapWithState - a DStream
>> transformation for stateful stream processing, supercedes
>> updateStateByKey in functionality and performance.
>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>> record deaggregation - Kinesis streams have been upgraded to use
>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>> message handler function - Allows arbitraray function to be
>> applied to a Kinesis record in the Kinesis receiver before to customize
>> what data is to be stored in memory.
>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>> Streamng Listener API - Get streaming statistics (scheduling
>> delays, batch processing times, etc.) in streaming.
>>
>>
>> - UI Improvements
>> - Made failures visible in the streaming tab, in the timelines,
>> batch list, and batch details page.
>> - Made output operations visible in the streaming tab as progress
>> bars.
>>
>> MLlibNew algorithms/models
>>
>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>> analysis - Log-linear model for survival analysis
>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>> equation for least squares - Normal equation solver, providing R-like
>> model summary statistics
>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>> hypothesis testing - A/B testing in the Spark Streaming framework
>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>> transformer
>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>> K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>> - ML Pipelines
>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>> persistence - Save/load for ML Pipelines, with partial coverage of
>> spark.mlalgorithms
>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>> Pipelines
>> - R API
>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>> statistics for GLMs - (Partial) R-like stats for ordinary least
>> squares via summary(model)
>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>> interactions in R formula - Interaction operator ":" in R formula
>> - Python API - Many improvements to Python API to approach feature
>> parity
>>
>> Misc improvements
>>
>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>> weights for GLMs - Logistic and Linear Regression can take instance
>> weights
>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>> and bivariate statistics in DataFrames - Variance, stddev,
>> correlations, etc.
>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>> versions - Documentation includes initial version when classes and
>> methods were added
>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>> example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> deprecated.
>> - In spark.ml.classification.LogisticRegressionModel and
>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> deprecated, in favor of the new name "coefficients." This helps
>> disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> semantics in 1.6. Previously, it was a threshold for absolute change in
>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>> For large errors, it uses relative error (relative to the previous error);
>> for small errors (< 0.01), it uses absolute error.
>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>> default, with an option not to. This matches the behavior of the simpler
>> Tokenizer transformer.
>> - Spark SQL's partition discovery has been changed to only discover
>> partition directories that are children of the given path. (i.e. if
>> path="/my/data/x=1" then x=1 will no longer be considered a partition
>> but only children of x=1.) This behavior can be overridden by
>> manually specifying the basePath that partitioning discovery should
>> start with (SPARK-11678
>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>> - When casting a value of an integral type to timestamp (e.g. casting
>> a long value to timestamp), the value is treated as being in seconds
>> instead of milliseconds (SPARK-11724
>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>> - With the improved query planner for queries having distinct
>> aggregations (SPARK-9241
>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>> query having a single distinct aggregation has been changed to a more
>> robust version. To switch back to the plan generated by Spark 1.5's
>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>> ).
>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Krishna Sankar <ks...@gmail.com>.
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
Center And Scale OK
2.5. RDD operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.3.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
Cheers & Good work guys
<k/>
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
> Performance - Improve Parquet scan performance when using flat schemas.
> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
> Session Management - Isolated devault database (i.e USE mydb) even on
> shared clusters.
> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
> API - A type-safe API (similar to RDDs) that performs many operations
> on serialized binary data and code generation (i.e. Project Tungsten).
> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
> Memory Management - Shared memory for execution and caching instead of
> exclusive division of the regions.
> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
> non-standard JSON files - Added options to read non-standard JSON
> files (e.g. single-quotes, unquoted attributes)
> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
> Metrics for SQL Execution - Display statistics on a peroperator basis
> for memory usage and spilled data size.
> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
> Columnar Cache Performance - Significant (up to 14x) speed up when
> caching data that contains complex types in DataFrames or SQL.
> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (<=>) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
> API Avoid Double Filter - When implemeting a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
> Layout of Cached Data - storing partitioning and ordering schemes in
> In-memory table scan, and adding distributeBy and localSort to DF API
> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
> query execution - Intial support for automatically selecting the
> number of reducers for joins and aggregations.
> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
> query planner for queries having distinct aggregations - Query plans
> of distinct aggregations are more robust when distinct columns have high
> cardinality.
>
> Spark Streaming
>
> - API Updates
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
> improved state management - mapWithState - a DStream transformation
> for stateful stream processing, supercedes updateStateByKey in
> functionality and performance.
> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
> record deaggregation - Kinesis streams have been upgraded to use
> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
> message handler function - Allows arbitraray function to be applied
> to a Kinesis record in the Kinesis receiver before to customize what data
> is to be stored in memory.
> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
> Streamng Listener API - Get streaming statistics (scheduling
> delays, batch processing times, etc.) in streaming.
>
>
> - UI Improvements
> - Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> - Made output operations visible in the streaming tab as progress
> bars.
>
> MLlibNew algorithms/models
>
> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
> analysis - Log-linear model for survival analysis
> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing R-like
> model summary statistics
> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
> K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
> - ML Pipelines
> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
> persistence - Save/load for ML Pipelines, with partial coverage of
> spark.mlalgorithms
> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
> in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> - R API
> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
> statistics for GLMs - (Partial) R-like stats for ordinary least
> squares via summary(model)
> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
> interactions in R formula - Interaction operator ":" in R formula
> - Python API - Many improvements to Python API to approach feature
> parity
>
> Misc improvements
>
> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
> weights for GLMs - Logistic and Linear Regression can take instance
> weights
> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
> and bivariate statistics in DataFrames - Variance, stddev,
> correlations, etc.
> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
> data source - LIBSVM as a SQL data sourceDocumentation improvements
> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
> example code - Automated testing for code in user guide examples
>
> Deprecations
>
> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> - In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change in
> error. Now, it resembles the behavior of GradientDescent convergenceTol:
> For large errors, it uses relative error (relative to the previous error);
> for small errors (< 0.01), it uses absolute error.
> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to lowercase by
> default, with an option not to. This matches the behavior of the simpler
> Tokenizer transformer.
> - Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition
> but only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with (
> SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
> - When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> - With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a more
> robust version. To switch back to the plan generated by Spark 1.5's
> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Denny Lee <de...@gmail.com>.
+1 (non-binding)
Tested a number of tests surrounding DataFrames, Datasets, and ML.
On Wed, Dec 16, 2015 at 1:32 PM Michael Armbrust <mi...@databricks.com>
wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
> Performance - Improve Parquet scan performance when using flat schemas.
> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
> Session Management - Isolated devault database (i.e USE mydb) even on
> shared clusters.
> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
> API - A type-safe API (similar to RDDs) that performs many operations
> on serialized binary data and code generation (i.e. Project Tungsten).
> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
> Memory Management - Shared memory for execution and caching instead of
> exclusive division of the regions.
> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
> non-standard JSON files - Added options to read non-standard JSON
> files (e.g. single-quotes, unquoted attributes)
> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
> Metrics for SQL Execution - Display statistics on a peroperator basis
> for memory usage and spilled data size.
> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
> Columnar Cache Performance - Significant (up to 14x) speed up when
> caching data that contains complex types in DataFrames or SQL.
> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (<=>) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
> API Avoid Double Filter - When implemeting a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
> Layout of Cached Data - storing partitioning and ordering schemes in
> In-memory table scan, and adding distributeBy and localSort to DF API
> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
> query execution - Intial support for automatically selecting the
> number of reducers for joins and aggregations.
> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
> query planner for queries having distinct aggregations - Query plans
> of distinct aggregations are more robust when distinct columns have high
> cardinality.
>
> Spark Streaming
>
> - API Updates
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
> improved state management - mapWithState - a DStream transformation
> for stateful stream processing, supercedes updateStateByKey in
> functionality and performance.
> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
> record deaggregation - Kinesis streams have been upgraded to use
> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
> message handler function - Allows arbitraray function to be applied
> to a Kinesis record in the Kinesis receiver before to customize what data
> is to be stored in memory.
> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
> Streamng Listener API - Get streaming statistics (scheduling
> delays, batch processing times, etc.) in streaming.
>
>
> - UI Improvements
> - Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> - Made output operations visible in the streaming tab as progress
> bars.
>
> MLlibNew algorithms/models
>
> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
> analysis - Log-linear model for survival analysis
> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing R-like
> model summary statistics
> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
> K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
> - ML Pipelines
> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
> persistence - Save/load for ML Pipelines, with partial coverage of
> spark.mlalgorithms
> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
> in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> - R API
> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
> statistics for GLMs - (Partial) R-like stats for ordinary least
> squares via summary(model)
> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
> interactions in R formula - Interaction operator ":" in R formula
> - Python API - Many improvements to Python API to approach feature
> parity
>
> Misc improvements
>
> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
> weights for GLMs - Logistic and Linear Regression can take instance
> weights
> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
> and bivariate statistics in DataFrames - Variance, stddev,
> correlations, etc.
> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
> data source - LIBSVM as a SQL data sourceDocumentation improvements
> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
> example code - Automated testing for code in user guide examples
>
> Deprecations
>
> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> - In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change in
> error. Now, it resembles the behavior of GradientDescent convergenceTol:
> For large errors, it uses relative error (relative to the previous error);
> for small errors (< 0.01), it uses absolute error.
> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to lowercase by
> default, with an option not to. This matches the behavior of the simpler
> Tokenizer transformer.
> - Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition
> but only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with (
> SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
> - When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> - With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a more
> robust version. To switch back to the plan generated by Spark 1.5's
> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
+1 (non binding)
Tested in standalone and yarn with different samples.
Regards
JB
On 12/16/2015 10:32 PM, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will
> not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
>
> Notable changes since 1.6 RC1
>
>
> Spark Streaming
>
> * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> |trackStateByKey| has been renamed to |mapWithState|
>
>
> Spark SQL
>
> * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> * SPARK-12258
> <https://issues.apache.org/jira/browse/SPARK-12258> correct passing
> null into ScalaUDF
>
>
> Notable Features Since 1.5
>
>
> Spark SQL
>
> * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
> Parquet Performance - Improve Parquet scan performance when using
> flat schemas.
> * SPARK-10810
> <https://issues.apache.org/jira/browse/SPARK-10810>Session
> Management - Isolated devault database (i.e |USE mydb|) even on
> shared clusters.
> * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
> Dataset API - A type-safe API (similar to RDDs) that performs many
> operations on serialized binary data and code generation (i.e.
> Project Tungsten).
> * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
> Unified Memory Management - Shared memory for execution and caching
> instead of exclusive division of the regions.
> * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
> Reading non-standard JSON files - Added options to read non-standard
> JSON files (e.g. single-quotes, unquoted attributes)
> * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
> Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
> In-memory Columnar Cache Performance - Significant (up to 14x) speed
> up when caching data that contains complex types in DataFrames or SQL.
> * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (|<=>|) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
> Datasource API Avoid Double Filter - When implemeting a datasource
> with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
> Advanced Layout of Cached Data - storing partitioning and ordering
> schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
> Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
> Improved query planner for queries having distinct aggregations -
> Query plans of distinct aggregations are more robust when distinct
> columns have high cardinality.
>
>
> Spark Streaming
>
> * API Updates
> o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> New improved state management - |mapWithState| - a DStream
> transformation for stateful stream processing, supercedes
> |updateStateByKey| in functionality and performance.
> o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
> Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation
> of KPL-aggregated records.
> o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
> Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to
> customize what data is to be stored in memory.
> o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
> Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> * UI Improvements
> o Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> o Made output operations visible in the streaming tab as progress
> bars.
>
>
> MLlib
>
>
> New algorithms/models
>
> * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
> Survival analysis - Log-linear model for survival analysis
> * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing
> R-like model summary statistics
> * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
> Bisecting K-Means clustering - Fast top-down clustering variant of
> K-Means
>
>
> API improvements
>
> * ML Pipelines
> o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
> Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.ml <http://spark.ml/>algorithms
> o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
> LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> * R API
> o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
> R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
> Feature interactions in R formula - Interaction operator ":" in
> R formula
> * Python API - Many improvements to Python API to approach feature parity
>
>
> Misc improvements
>
> * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
> Instance weights for GLMs - Logistic and Linear Regression can take
> instance weights
> * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
> Univariate and bivariate statistics in DataFrames - Variance,
> stddev, correlations, etc.
> * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
> LIBSVM data source - LIBSVM as a SQL data source
>
>
> Documentation improvements
>
> * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
> Testable example code - Automated testing for code in user guide
> examples
>
>
> Deprecations
>
> * In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> * In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has
> been deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
>
> Changes of behavior
>
> * spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change
> in error. Now, it resembles the behavior of GradientDescent
> convergenceTol: For large errors, it uses relative error (relative
> to the previous error); for small errors (< 0.01), it uses absolute
> error.
> * spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to
> lowercase by default, with an option not to. This matches the
> behavior of the simpler Tokenizer transformer.
> * Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> |path="/my/data/x=1"| then |x=1| will no longer be considered a
> partition but only children of |x=1|.) This behavior can be
> overridden by manually specifying the |basePath| that partitioning
> discovery should start with (SPARK-11678
> <https://issues.apache.org/jira/browse/SPARK-11678>).
> * When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> * With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a
> more robust version. To switch back to the plan generated by Spark
> 1.5's planner, please set
> |spark.sql.specializeSingleDistinctAggPlanning| to
> |true| (SPARK-12077
> <https://issues.apache.org/jira/browse/SPARK-12077>).
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Michael Armbrust <mi...@databricks.com>.
It's come to my attention that there have been several bug fixes merged
since RC3:
- SPARK-12404 - Fix serialization error for Datasets with
Timestamps/Arrays/Decimal
- SPARK-12218 - Fix incorrect pushdown of filters to parquet
- SPARK-12395 - Fix join columns of outer join for DataFrame using
- SPARK-12413 - Fix mesos HA
Normally, these would probably not be sufficient to hold the release,
however with the holidays going on in the US this week, we don't have the
resources to finalize 1.6 until next Monday. Given this delay anyway, I
propose that we cut one final RC with the above fixes and plan for the
actual release first thing next week.
I'll post RC4 shortly and cancel this vote if there are no objections.
Since this vote nearly passed with no major issues, I don't anticipate any
problems with RC4.
Michael
On Sat, Dec 19, 2015 at 11:44 PM, Jeff Zhang <zj...@gmail.com> wrote:
> +1 (non-binding)
>
> All the test passed, and run it on HDP 2.3.2 sandbox successfully.
>
> On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende <lu...@gmail.com>
> wrote:
>
>> +1 (non-binding)
>>
>> Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
>>
>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc3
>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>
>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>> trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>> bugs in eviction of storage memory by execution.
>>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>> passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>> Performance - Improve Parquet scan performance when using flat
>>> schemas.
>>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>> Session Management - Isolated devault database (i.e USE mydb) even
>>> on shared clusters.
>>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>> API - A type-safe API (similar to RDDs) that performs many
>>> operations on serialized binary data and code generation (i.e. Project
>>> Tungsten).
>>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>> Memory Management - Shared memory for execution and caching instead
>>> of exclusive division of the regions.
>>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>> Queries on Files - Concise syntax for running SQL queries over files
>>> of any supported format without registering a table.
>>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>> non-standard JSON files - Added options to read non-standard JSON
>>> files (e.g. single-quotes, unquoted attributes)
>>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>> Metrics for SQL Execution - Display statistics on a peroperator
>>> basis for memory usage and spilled data size.
>>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>> (*) expansion for StructTypes - Makes it easier to nest and unest
>>> arbitrary numbers of columns
>>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>> Columnar Cache Performance - Significant (up to 14x) speed up when
>>> caching data that contains complex types in DataFrames or SQL.
>>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>> null-safe joins - Joins using null-safe equality (<=>) will now
>>> execute using SortMergeJoin instead of computing a cartisian product.
>>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>> Execution Using Off-Heap Memory - Support for configuring query
>>> execution to occur using off-heap memory to avoid GC overhead
>>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>> API Avoid Double Filter - When implemeting a datasource with filter
>>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>> pushed-down filter.
>>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>> Layout of Cached Data - storing partitioning and ordering schemes in
>>> In-memory table scan, and adding distributeBy and localSort to DF API
>>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>> query execution - Intial support for automatically selecting the
>>> number of reducers for joins and aggregations.
>>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>> query planner for queries having distinct aggregations - Query plans
>>> of distinct aggregations are more robust when distinct columns have high
>>> cardinality.
>>>
>>> Spark Streaming
>>>
>>> - API Updates
>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>>> improved state management - mapWithState - a DStream
>>> transformation for stateful stream processing, supercedes
>>> updateStateByKey in functionality and performance.
>>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>> record deaggregation - Kinesis streams have been upgraded to use
>>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>> message handler function - Allows arbitraray function to be
>>> applied to a Kinesis record in the Kinesis receiver before to customize
>>> what data is to be stored in memory.
>>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>> Streamng Listener API - Get streaming statistics (scheduling
>>> delays, batch processing times, etc.) in streaming.
>>>
>>>
>>> - UI Improvements
>>> - Made failures visible in the streaming tab, in the timelines,
>>> batch list, and batch details page.
>>> - Made output operations visible in the streaming tab as progress
>>> bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>> analysis - Log-linear model for survival analysis
>>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>> equation for least squares - Normal equation solver, providing
>>> R-like model summary statistics
>>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>> hypothesis testing - A/B testing in the Spark Streaming framework
>>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>> transformer
>>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>> K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>> - ML Pipelines
>>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>> persistence - Save/load for ML Pipelines, with partial coverage
>>> of spark.mlalgorithms
>>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>> Pipelines
>>> - R API
>>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>> statistics for GLMs - (Partial) R-like stats for ordinary least
>>> squares via summary(model)
>>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>> interactions in R formula - Interaction operator ":" in R formula
>>> - Python API - Many improvements to Python API to approach feature
>>> parity
>>>
>>> Misc improvements
>>>
>>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>> weights for GLMs - Logistic and Linear Regression can take instance
>>> weights
>>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>> and bivariate statistics in DataFrames - Variance, stddev,
>>> correlations, etc.
>>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>> versions - Documentation includes initial version when classes and
>>> methods were added
>>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>> example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>> deprecated.
>>> - In spark.ml.classification.LogisticRegressionModel and
>>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>> deprecated, in favor of the new name "coefficients." This helps
>>> disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>> semantics in 1.6. Previously, it was a threshold for absolute change in
>>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>> For large errors, it uses relative error (relative to the previous error);
>>> for small errors (< 0.01), it uses absolute error.
>>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>>> default, with an option not to. This matches the behavior of the simpler
>>> Tokenizer transformer.
>>> - Spark SQL's partition discovery has been changed to only discover
>>> partition directories that are children of the given path. (i.e. if
>>> path="/my/data/x=1" then x=1 will no longer be considered a
>>> partition but only children of x=1.) This behavior can be overridden
>>> by manually specifying the basePath that partitioning discovery
>>> should start with (SPARK-11678
>>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>>> - When casting a value of an integral type to timestamp (e.g.
>>> casting a long value to timestamp), the value is treated as being in
>>> seconds instead of milliseconds (SPARK-11724
>>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>>> - With the improved query planner for queries having distinct
>>> aggregations (SPARK-9241
>>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>> query having a single distinct aggregation has been changed to a more
>>> robust version. To switch back to the plan generated by Spark 1.5's
>>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>> ).
>>>
>>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
回复: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Ricky <49...@qq.com>.
SizeBasedRollingPolicy print too many log when spark.executor.logs.rolling.strategy is size , shouldRollover use logInfo method:
def shouldRollover(bytesToBeWritten: Long): Boolean = {
logInfo(s"$bytesToBeWritten + $bytesWrittenSinceRollover > $rolloverSizeBytes")
bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes
}
logs.rolling:
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 134217728
spark.executor.logs.rolling.maxRetainedFiles 8
Can use logdebug instead of loginfo ?
------------------
Best Regards
Ricky Yang
------------------ 原始邮件 ------------------
发件人: "Jeff Zhang";<zj...@gmail.com>;
发送时间: 2015年12月20日(星期天) 下午3:44
收件人: "Luciano Resende"<lu...@gmail.com>;
抄送: "Michael Armbrust"<mi...@databricks.com>; "dev@spark.apache.org"<de...@spark.apache.org>;
主题: Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
+1 (non-binding)
All the test passed, and run it on HDP 2.3.2 sandbox successfully.
On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende < luckbr1975@gmail.com > wrote:
+1 (non-binding)
Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust < michael@databricks.com > wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18 :00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.
===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.
==================================================
== Major changes to help you focus your testing ==
==================================================
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
SPARK-2629 trackStateByKey has been renamed to mapWithState
Spark SQL
SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
SPARK-10810 Session Management - Isolated devault database (i.e USE mydb ) even on shared clusters.
SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
SPARK-10917 , SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
SPARK-11111 Fast null-safe joins - Joins using null-safe equality ( <=> ) will now execute using SortMergeJoin instead of computing a cartisian product.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
SPARK-9858 Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
Spark Streaming
API Updates
SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
SPARK-6328 Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
UI Improvements
Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
SPARK-8518 Survival analysis - Log-linear model for survival analysis
SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
ML Pipelines
SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
SPARK-9681 Feature interactions in R formula - Interaction operator ":" in R formula
Python API - Many improvements to Python API to approach feature parity
Misc improvements
SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
SPARK-10384 , SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
SPARK-11337 Testable example code - Automated testing for code in user guide examples
Deprecations
In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.
Changes of behavior
spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1 .) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with ( SPARK-11678 ).
When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds ( SPARK-11724 ).
With the improved query planner for queries having distinct aggregations ( SPARK-9241 ), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true ( SPARK-12077 ).
--
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/
--
Best Regards
Jeff Zhang
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Jeff Zhang <zj...@gmail.com>.
+1 (non-binding)
All the test passed, and run it on HDP 2.3.2 sandbox successfully.
On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende <lu...@gmail.com>
wrote:
> +1 (non-binding)
>
> Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>> trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>> bugs in eviction of storage memory by execution.
>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>> passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>> Performance - Improve Parquet scan performance when using flat
>> schemas.
>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>> Session Management - Isolated devault database (i.e USE mydb) even on
>> shared clusters.
>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>> API - A type-safe API (similar to RDDs) that performs many operations
>> on serialized binary data and code generation (i.e. Project Tungsten).
>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>> Memory Management - Shared memory for execution and caching instead
>> of exclusive division of the regions.
>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>> Queries on Files - Concise syntax for running SQL queries over files
>> of any supported format without registering a table.
>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>> non-standard JSON files - Added options to read non-standard JSON
>> files (e.g. single-quotes, unquoted attributes)
>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>> Metrics for SQL Execution - Display statistics on a peroperator basis
>> for memory usage and spilled data size.
>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>> (*) expansion for StructTypes - Makes it easier to nest and unest
>> arbitrary numbers of columns
>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>> Columnar Cache Performance - Significant (up to 14x) speed up when
>> caching data that contains complex types in DataFrames or SQL.
>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>> null-safe joins - Joins using null-safe equality (<=>) will now
>> execute using SortMergeJoin instead of computing a cartisian product.
>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>> Execution Using Off-Heap Memory - Support for configuring query
>> execution to occur using off-heap memory to avoid GC overhead
>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>> API Avoid Double Filter - When implemeting a datasource with filter
>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>> pushed-down filter.
>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>> Layout of Cached Data - storing partitioning and ordering schemes in
>> In-memory table scan, and adding distributeBy and localSort to DF API
>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>> query execution - Intial support for automatically selecting the
>> number of reducers for joins and aggregations.
>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>> query planner for queries having distinct aggregations - Query plans
>> of distinct aggregations are more robust when distinct columns have high
>> cardinality.
>>
>> Spark Streaming
>>
>> - API Updates
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>> improved state management - mapWithState - a DStream
>> transformation for stateful stream processing, supercedes
>> updateStateByKey in functionality and performance.
>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>> record deaggregation - Kinesis streams have been upgraded to use
>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>> message handler function - Allows arbitraray function to be
>> applied to a Kinesis record in the Kinesis receiver before to customize
>> what data is to be stored in memory.
>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>> Streamng Listener API - Get streaming statistics (scheduling
>> delays, batch processing times, etc.) in streaming.
>>
>>
>> - UI Improvements
>> - Made failures visible in the streaming tab, in the timelines,
>> batch list, and batch details page.
>> - Made output operations visible in the streaming tab as progress
>> bars.
>>
>> MLlibNew algorithms/models
>>
>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>> analysis - Log-linear model for survival analysis
>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>> equation for least squares - Normal equation solver, providing R-like
>> model summary statistics
>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>> hypothesis testing - A/B testing in the Spark Streaming framework
>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>> transformer
>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>> K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>> - ML Pipelines
>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>> persistence - Save/load for ML Pipelines, with partial coverage of
>> spark.mlalgorithms
>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>> Pipelines
>> - R API
>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>> statistics for GLMs - (Partial) R-like stats for ordinary least
>> squares via summary(model)
>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>> interactions in R formula - Interaction operator ":" in R formula
>> - Python API - Many improvements to Python API to approach feature
>> parity
>>
>> Misc improvements
>>
>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>> weights for GLMs - Logistic and Linear Regression can take instance
>> weights
>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>> and bivariate statistics in DataFrames - Variance, stddev,
>> correlations, etc.
>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>> versions - Documentation includes initial version when classes and
>> methods were added
>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>> example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> deprecated.
>> - In spark.ml.classification.LogisticRegressionModel and
>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> deprecated, in favor of the new name "coefficients." This helps
>> disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> semantics in 1.6. Previously, it was a threshold for absolute change in
>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>> For large errors, it uses relative error (relative to the previous error);
>> for small errors (< 0.01), it uses absolute error.
>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>> default, with an option not to. This matches the behavior of the simpler
>> Tokenizer transformer.
>> - Spark SQL's partition discovery has been changed to only discover
>> partition directories that are children of the given path. (i.e. if
>> path="/my/data/x=1" then x=1 will no longer be considered a partition
>> but only children of x=1.) This behavior can be overridden by
>> manually specifying the basePath that partitioning discovery should
>> start with (SPARK-11678
>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>> - When casting a value of an integral type to timestamp (e.g. casting
>> a long value to timestamp), the value is treated as being in seconds
>> instead of milliseconds (SPARK-11724
>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>> - With the improved query planner for queries having distinct
>> aggregations (SPARK-9241
>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>> query having a single distinct aggregation has been changed to a more
>> robust version. To switch back to the plan generated by Spark 1.5's
>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>> ).
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
--
Best Regards
Jeff Zhang
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Luciano Resende <lu...@gmail.com>.
+1 (non-binding)
Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
> Performance - Improve Parquet scan performance when using flat schemas.
> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
> Session Management - Isolated devault database (i.e USE mydb) even on
> shared clusters.
> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
> API - A type-safe API (similar to RDDs) that performs many operations
> on serialized binary data and code generation (i.e. Project Tungsten).
> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
> Memory Management - Shared memory for execution and caching instead of
> exclusive division of the regions.
> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
> non-standard JSON files - Added options to read non-standard JSON
> files (e.g. single-quotes, unquoted attributes)
> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
> Metrics for SQL Execution - Display statistics on a peroperator basis
> for memory usage and spilled data size.
> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
> Columnar Cache Performance - Significant (up to 14x) speed up when
> caching data that contains complex types in DataFrames or SQL.
> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (<=>) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
> API Avoid Double Filter - When implemeting a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
> Layout of Cached Data - storing partitioning and ordering schemes in
> In-memory table scan, and adding distributeBy and localSort to DF API
> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
> query execution - Intial support for automatically selecting the
> number of reducers for joins and aggregations.
> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
> query planner for queries having distinct aggregations - Query plans
> of distinct aggregations are more robust when distinct columns have high
> cardinality.
>
> Spark Streaming
>
> - API Updates
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
> improved state management - mapWithState - a DStream transformation
> for stateful stream processing, supercedes updateStateByKey in
> functionality and performance.
> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
> record deaggregation - Kinesis streams have been upgraded to use
> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
> message handler function - Allows arbitraray function to be applied
> to a Kinesis record in the Kinesis receiver before to customize what data
> is to be stored in memory.
> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
> Streamng Listener API - Get streaming statistics (scheduling
> delays, batch processing times, etc.) in streaming.
>
>
> - UI Improvements
> - Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> - Made output operations visible in the streaming tab as progress
> bars.
>
> MLlibNew algorithms/models
>
> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
> analysis - Log-linear model for survival analysis
> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing R-like
> model summary statistics
> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
> K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
> - ML Pipelines
> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
> persistence - Save/load for ML Pipelines, with partial coverage of
> spark.mlalgorithms
> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
> in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> - R API
> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
> statistics for GLMs - (Partial) R-like stats for ordinary least
> squares via summary(model)
> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
> interactions in R formula - Interaction operator ":" in R formula
> - Python API - Many improvements to Python API to approach feature
> parity
>
> Misc improvements
>
> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
> weights for GLMs - Logistic and Linear Regression can take instance
> weights
> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
> and bivariate statistics in DataFrames - Variance, stddev,
> correlations, etc.
> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
> data source - LIBSVM as a SQL data sourceDocumentation improvements
> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
> example code - Automated testing for code in user guide examples
>
> Deprecations
>
> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> - In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change in
> error. Now, it resembles the behavior of GradientDescent convergenceTol:
> For large errors, it uses relative error (relative to the previous error);
> for small errors (< 0.01), it uses absolute error.
> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to lowercase by
> default, with an option not to. This matches the behavior of the simpler
> Tokenizer transformer.
> - Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition
> but only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with (
> SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
> - When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> - With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a more
> robust version. To switch back to the plan generated by Spark 1.5's
> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
--
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Michael Armbrust <mi...@databricks.com>.
+1
On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:
> +1
>
> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
> <https://github.com/apache/spark/pull/10332>).
>
> Also tested on standalone client and cluster mode. No problems.
>
> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>
>> I also noticed that spark.replClassServer.host and
>> spark.replClassServer.port aren’t used anymore. The transport now happens
>> over the main RpcEnv.
>>
>> Kind regards,
>> Radek Gruchalski
>> radek@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>
>> I was going to say that spark.executor.port is not used anymore in
>> 1.6, but damn, there's still that akka backend hanging around there
>> even when netty is being used... we should fix this, should be a
>> simple one-liner.
>>
>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>> wrote:
>>
>> -0 (non-binding)
>>
>> I have observed that when we set spark.executor.port in 1.6, we get
>> thrown a
>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>> anyone else seeing this?
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Yin Huai <yh...@databricks.com>.
+1
On Wed, Dec 16, 2015 at 7:19 PM, Patrick Wendell <pw...@gmail.com> wrote:
> +1
>
> On Wed, Dec 16, 2015 at 6:15 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Ran test suite (minus docker-integration-tests)
>> All passed
>>
>> +1
>>
>> [INFO] Spark Project External ZeroMQ ...................... SUCCESS [
>> 13.647 s]
>> [INFO] Spark Project External Kafka ....................... SUCCESS [
>> 45.424 s]
>> [INFO] Spark Project Examples ............................. SUCCESS
>> [02:06 min]
>> [INFO] Spark Project External Kafka Assembly .............. SUCCESS [
>> 11.280 s]
>> [INFO]
>> ------------------------------------------------------------------------
>> [INFO] BUILD SUCCESS
>> [INFO]
>> ------------------------------------------------------------------------
>> [INFO] Total time: 01:49 h
>> [INFO] Finished at: 2015-12-16T17:06:58-08:00
>>
>> On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:
>>
>>> +1
>>>
>>> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
>>> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
>>> <https://github.com/apache/spark/pull/10332>).
>>>
>>> Also tested on standalone client and cluster mode. No problems.
>>>
>>> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>>>
>>>> I also noticed that spark.replClassServer.host and
>>>> spark.replClassServer.port aren’t used anymore. The transport now happens
>>>> over the main RpcEnv.
>>>>
>>>> Kind regards,
>>>> Radek Gruchalski
>>>> radek@gruchalski.com <ra...@gruchalski.com>
>>>> de.linkedin.com/in/radgruchalski/
>>>>
>>>>
>>>> *Confidentiality:*This communication is intended for the above-named
>>>> person and may be confidential and/or legally privileged.
>>>> If it has come to you in error you must take no action based on it, nor
>>>> must you copy or show it to anyone; please delete/destroy and inform the
>>>> sender immediately.
>>>>
>>>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>>>
>>>> I was going to say that spark.executor.port is not used anymore in
>>>> 1.6, but damn, there's still that akka backend hanging around there
>>>> even when netty is being used... we should fix this, should be a
>>>> simple one-liner.
>>>>
>>>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <
>>>> thesinginpirate@gmail.com> wrote:
>>>>
>>>> -0 (non-binding)
>>>>
>>>> I have observed that when we set spark.executor.port in 1.6, we get
>>>> thrown a
>>>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2.
>>>> Is
>>>> anyone else seeing this?
>>>>
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>>
>>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Patrick Wendell <pw...@gmail.com>.
+1
On Wed, Dec 16, 2015 at 6:15 PM, Ted Yu <yu...@gmail.com> wrote:
> Ran test suite (minus docker-integration-tests)
> All passed
>
> +1
>
> [INFO] Spark Project External ZeroMQ ...................... SUCCESS [
> 13.647 s]
> [INFO] Spark Project External Kafka ....................... SUCCESS [
> 45.424 s]
> [INFO] Spark Project Examples ............................. SUCCESS [02:06
> min]
> [INFO] Spark Project External Kafka Assembly .............. SUCCESS [
> 11.280 s]
> [INFO]
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 01:49 h
> [INFO] Finished at: 2015-12-16T17:06:58-08:00
>
> On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:
>
>> +1
>>
>> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
>> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
>> <https://github.com/apache/spark/pull/10332>).
>>
>> Also tested on standalone client and cluster mode. No problems.
>>
>> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>>
>>> I also noticed that spark.replClassServer.host and
>>> spark.replClassServer.port aren’t used anymore. The transport now happens
>>> over the main RpcEnv.
>>>
>>> Kind regards,
>>> Radek Gruchalski
>>> radek@gruchalski.com <ra...@gruchalski.com>
>>> de.linkedin.com/in/radgruchalski/
>>>
>>>
>>> *Confidentiality:*This communication is intended for the above-named
>>> person and may be confidential and/or legally privileged.
>>> If it has come to you in error you must take no action based on it, nor
>>> must you copy or show it to anyone; please delete/destroy and inform the
>>> sender immediately.
>>>
>>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>>
>>> I was going to say that spark.executor.port is not used anymore in
>>> 1.6, but damn, there's still that akka backend hanging around there
>>> even when netty is being used... we should fix this, should be a
>>> simple one-liner.
>>>
>>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>>> wrote:
>>>
>>> -0 (non-binding)
>>>
>>> I have observed that when we set spark.executor.port in 1.6, we get
>>> thrown a
>>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>>> anyone else seeing this?
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Ted Yu <yu...@gmail.com>.
Ran test suite (minus docker-integration-tests)
All passed
+1
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [
13.647 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [
45.424 s]
[INFO] Spark Project Examples ............................. SUCCESS [02:06
min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [
11.280 s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 01:49 h
[INFO] Finished at: 2015-12-16T17:06:58-08:00
On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or <an...@databricks.com> wrote:
> +1
>
> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
> <https://github.com/apache/spark/pull/10332>).
>
> Also tested on standalone client and cluster mode. No problems.
>
> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>
>> I also noticed that spark.replClassServer.host and
>> spark.replClassServer.port aren’t used anymore. The transport now happens
>> over the main RpcEnv.
>>
>> Kind regards,
>> Radek Gruchalski
>> radek@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>
>> I was going to say that spark.executor.port is not used anymore in
>> 1.6, but damn, there's still that akka backend hanging around there
>> even when netty is being used... we should fix this, should be a
>> simple one-liner.
>>
>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>> wrote:
>>
>> -0 (non-binding)
>>
>> I have observed that when we set spark.executor.port in 1.6, we get
>> thrown a
>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>> anyone else seeing this?
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Iulian Dragoș <iu...@typesafe.com>.
-0 (non-binding)
Unfortunately the Mesos cluster regression is still there (see my comment
<https://github.com/apache/spark/pull/10332/files#r47902198> for
explanations). I'm not voting to delay the release any longer though.
We tested (and passed) Mesos in:
- client mode
- fine/coarse-grained
- with/without roles
iulian
On Thu, Dec 17, 2015 at 1:37 AM, Andrew Or <an...@databricks.com> wrote:
> +1
>
> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
> <https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
> <https://github.com/apache/spark/pull/10332>).
>
> Also tested on standalone client and cluster mode. No problems.
>
> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
>
>> I also noticed that spark.replClassServer.host and
>> spark.replClassServer.port aren’t used anymore. The transport now happens
>> over the main RpcEnv.
>>
>> Kind regards,
>> Radek Gruchalski
>> radek@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>
>> I was going to say that spark.executor.port is not used anymore in
>> 1.6, but damn, there's still that akka backend hanging around there
>> even when netty is being used... we should fix this, should be a
>> simple one-liner.
>>
>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
>> wrote:
>>
>> -0 (non-binding)
>>
>> I have observed that when we set spark.executor.port in 1.6, we get
>> thrown a
>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>> anyone else seeing this?
>>
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>
--
--
Iulian Dragos
------
Reactive Apps on the JVM
www.typesafe.com
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Andrew Or <an...@databricks.com>.
+1
Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
<https://issues.apache.org/jira/browse/SPARK-12345> / PR10332
<https://github.com/apache/spark/pull/10332>).
Also tested on standalone client and cluster mode. No problems.
2015-12-16 15:16 GMT-08:00 Rad Gruchalski <ra...@gruchalski.com>:
> I also noticed that spark.replClassServer.host and
> spark.replClassServer.port aren’t used anymore. The transport now happens
> over the main RpcEnv.
>
> Kind regards,
> Radek Gruchalski
> radek@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>
> I was going to say that spark.executor.port is not used anymore in
> 1.6, but damn, there's still that akka backend hanging around there
> even when netty is being used... we should fix this, should be a
> simple one-liner.
>
> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com>
> wrote:
>
> -0 (non-binding)
>
> I have observed that when we set spark.executor.port in 1.6, we get thrown
> a
> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
> anyone else seeing this?
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Rad Gruchalski <ra...@gruchalski.com>.
I also noticed that spark.replClassServer.host and spark.replClassServer.port aren’t used anymore. The transport now happens over the main RpcEnv.
Kind regards,
Radek Gruchalski
radek@gruchalski.com (mailto:radek@gruchalski.com)
(mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)
Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.
On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
> I was going to say that spark.executor.port is not used anymore in
> 1.6, but damn, there's still that akka backend hanging around there
> even when netty is being used... we should fix this, should be a
> simple one-liner.
>
> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <thesinginpirate@gmail.com (mailto:thesinginpirate@gmail.com)> wrote:
> > -0 (non-binding)
> >
> > I have observed that when we set spark.executor.port in 1.6, we get thrown a
> > NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
> > anyone else seeing this?
> >
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Marcelo Vanzin <va...@cloudera.com>.
I was going to say that spark.executor.port is not used anymore in
1.6, but damn, there's still that akka backend hanging around there
even when netty is being used... we should fix this, should be a
simple one-liner.
On Wed, Dec 16, 2015 at 2:35 PM, singinpirate <th...@gmail.com> wrote:
> -0 (non-binding)
>
> I have observed that when we set spark.executor.port in 1.6, we get thrown a
> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
> anyone else seeing this?
--
Marcelo
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by singinpirate <th...@gmail.com>.
-0 (non-binding)
I have observed that when we set spark.executor.port in 1.6, we get thrown
a NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
anyone else seeing this?
On Wed, Dec 16, 2015 at 2:26 PM Jiří Syrový <sy...@gmail.com> wrote:
> +1 Tested in standalone mode and so far seems to be fairly stable.
>
> 2015-12-16 22:32 GMT+01:00 Michael Armbrust <mi...@databricks.com>:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>> trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>> bugs in eviction of storage memory by execution.
>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>> passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>> Performance - Improve Parquet scan performance when using flat
>> schemas.
>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>> Session Management - Isolated devault database (i.e USE mydb) even on
>> shared clusters.
>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>> API - A type-safe API (similar to RDDs) that performs many operations
>> on serialized binary data and code generation (i.e. Project Tungsten).
>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>> Memory Management - Shared memory for execution and caching instead
>> of exclusive division of the regions.
>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>> Queries on Files - Concise syntax for running SQL queries over files
>> of any supported format without registering a table.
>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>> non-standard JSON files - Added options to read non-standard JSON
>> files (e.g. single-quotes, unquoted attributes)
>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>> Metrics for SQL Execution - Display statistics on a peroperator basis
>> for memory usage and spilled data size.
>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>> (*) expansion for StructTypes - Makes it easier to nest and unest
>> arbitrary numbers of columns
>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>> Columnar Cache Performance - Significant (up to 14x) speed up when
>> caching data that contains complex types in DataFrames or SQL.
>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>> null-safe joins - Joins using null-safe equality (<=>) will now
>> execute using SortMergeJoin instead of computing a cartisian product.
>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>> Execution Using Off-Heap Memory - Support for configuring query
>> execution to occur using off-heap memory to avoid GC overhead
>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>> API Avoid Double Filter - When implemeting a datasource with filter
>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>> pushed-down filter.
>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>> Layout of Cached Data - storing partitioning and ordering schemes in
>> In-memory table scan, and adding distributeBy and localSort to DF API
>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>> query execution - Intial support for automatically selecting the
>> number of reducers for joins and aggregations.
>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>> query planner for queries having distinct aggregations - Query plans
>> of distinct aggregations are more robust when distinct columns have high
>> cardinality.
>>
>> Spark Streaming
>>
>> - API Updates
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>> improved state management - mapWithState - a DStream
>> transformation for stateful stream processing, supercedes
>> updateStateByKey in functionality and performance.
>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>> record deaggregation - Kinesis streams have been upgraded to use
>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>> message handler function - Allows arbitraray function to be
>> applied to a Kinesis record in the Kinesis receiver before to customize
>> what data is to be stored in memory.
>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>> Streamng Listener API - Get streaming statistics (scheduling
>> delays, batch processing times, etc.) in streaming.
>>
>>
>> - UI Improvements
>> - Made failures visible in the streaming tab, in the timelines,
>> batch list, and batch details page.
>> - Made output operations visible in the streaming tab as progress
>> bars.
>>
>> MLlibNew algorithms/models
>>
>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>> analysis - Log-linear model for survival analysis
>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>> equation for least squares - Normal equation solver, providing R-like
>> model summary statistics
>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>> hypothesis testing - A/B testing in the Spark Streaming framework
>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>> transformer
>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>> K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>> - ML Pipelines
>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>> persistence - Save/load for ML Pipelines, with partial coverage of
>> spark.mlalgorithms
>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>> Pipelines
>> - R API
>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>> statistics for GLMs - (Partial) R-like stats for ordinary least
>> squares via summary(model)
>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>> interactions in R formula - Interaction operator ":" in R formula
>> - Python API - Many improvements to Python API to approach feature
>> parity
>>
>> Misc improvements
>>
>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>> weights for GLMs - Logistic and Linear Regression can take instance
>> weights
>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>> and bivariate statistics in DataFrames - Variance, stddev,
>> correlations, etc.
>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>> versions - Documentation includes initial version when classes and
>> methods were added
>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>> example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> deprecated.
>> - In spark.ml.classification.LogisticRegressionModel and
>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> deprecated, in favor of the new name "coefficients." This helps
>> disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> semantics in 1.6. Previously, it was a threshold for absolute change in
>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>> For large errors, it uses relative error (relative to the previous error);
>> for small errors (< 0.01), it uses absolute error.
>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>> default, with an option not to. This matches the behavior of the simpler
>> Tokenizer transformer.
>> - Spark SQL's partition discovery has been changed to only discover
>> partition directories that are children of the given path. (i.e. if
>> path="/my/data/x=1" then x=1 will no longer be considered a partition
>> but only children of x=1.) This behavior can be overridden by
>> manually specifying the basePath that partitioning discovery should
>> start with (SPARK-11678
>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>> - When casting a value of an integral type to timestamp (e.g. casting
>> a long value to timestamp), the value is treated as being in seconds
>> instead of milliseconds (SPARK-11724
>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>> - With the improved query planner for queries having distinct
>> aggregations (SPARK-9241
>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>> query having a single distinct aggregation has been changed to a more
>> robust version. To switch back to the plan generated by Spark 1.5's
>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>> ).
>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Jiří Syrový <sy...@gmail.com>.
+1 Tested in standalone mode and so far seems to be fairly stable.
2015-12-16 22:32 GMT+01:00 Michael Armbrust <mi...@databricks.com>:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
> Performance - Improve Parquet scan performance when using flat schemas.
> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
> Session Management - Isolated devault database (i.e USE mydb) even on
> shared clusters.
> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
> API - A type-safe API (similar to RDDs) that performs many operations
> on serialized binary data and code generation (i.e. Project Tungsten).
> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
> Memory Management - Shared memory for execution and caching instead of
> exclusive division of the regions.
> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
> non-standard JSON files - Added options to read non-standard JSON
> files (e.g. single-quotes, unquoted attributes)
> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
> Metrics for SQL Execution - Display statistics on a peroperator basis
> for memory usage and spilled data size.
> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
> Columnar Cache Performance - Significant (up to 14x) speed up when
> caching data that contains complex types in DataFrames or SQL.
> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (<=>) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
> API Avoid Double Filter - When implemeting a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
> Layout of Cached Data - storing partitioning and ordering schemes in
> In-memory table scan, and adding distributeBy and localSort to DF API
> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
> query execution - Intial support for automatically selecting the
> number of reducers for joins and aggregations.
> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
> query planner for queries having distinct aggregations - Query plans
> of distinct aggregations are more robust when distinct columns have high
> cardinality.
>
> Spark Streaming
>
> - API Updates
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
> improved state management - mapWithState - a DStream transformation
> for stateful stream processing, supercedes updateStateByKey in
> functionality and performance.
> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
> record deaggregation - Kinesis streams have been upgraded to use
> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
> message handler function - Allows arbitraray function to be applied
> to a Kinesis record in the Kinesis receiver before to customize what data
> is to be stored in memory.
> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
> Streamng Listener API - Get streaming statistics (scheduling
> delays, batch processing times, etc.) in streaming.
>
>
> - UI Improvements
> - Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> - Made output operations visible in the streaming tab as progress
> bars.
>
> MLlibNew algorithms/models
>
> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
> analysis - Log-linear model for survival analysis
> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing R-like
> model summary statistics
> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
> K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
> - ML Pipelines
> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
> persistence - Save/load for ML Pipelines, with partial coverage of
> spark.mlalgorithms
> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
> in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> - R API
> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
> statistics for GLMs - (Partial) R-like stats for ordinary least
> squares via summary(model)
> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
> interactions in R formula - Interaction operator ":" in R formula
> - Python API - Many improvements to Python API to approach feature
> parity
>
> Misc improvements
>
> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
> weights for GLMs - Logistic and Linear Regression can take instance
> weights
> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
> and bivariate statistics in DataFrames - Variance, stddev,
> correlations, etc.
> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
> data source - LIBSVM as a SQL data sourceDocumentation improvements
> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
> example code - Automated testing for code in user guide examples
>
> Deprecations
>
> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> - In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change in
> error. Now, it resembles the behavior of GradientDescent convergenceTol:
> For large errors, it uses relative error (relative to the previous error);
> for small errors (< 0.01), it uses absolute error.
> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to lowercase by
> default, with an option not to. This matches the behavior of the simpler
> Tokenizer transformer.
> - Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition
> but only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with (
> SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
> - When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> - With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a more
> robust version. To switch back to the plan generated by Spark 1.5's
> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Timothy O <to...@yahoo-inc.com.INVALID>.
+1
On Thursday, December 17, 2015 8:22 AM, Kousuke Saruta <sa...@oss.nttdata.co.jp> wrote:
+1
On 2015/12/17 6:32, Michael Armbrust wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
======================================= == How can I help test this release? == ======================================= If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
================================================ == What justifies a -1 vote for this release? == ================================================ This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.
=============================================================== == What should happen to JIRA tickets still targeting 1.6.0? == =============================================================== 1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release. 2. New features for non-alpha-modules should target 1.7+. 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.
================================================== == Major changes to help you focus your testing == ==================================================
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
- SPARK-2629 trackStateByKey has been renamed to mapWithState
Spark SQL
- SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
- SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
- SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
- SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
- SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
- SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
- SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
- SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
- SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
- SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
- SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
- SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
- SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
- SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
- SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
- SPARK-9858 Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
- SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
Spark Streaming
- API Updates
- SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
- SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
- SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
- SPARK-6328 Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
- UI Improvements
- Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
- Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
- SPARK-8518 Survival analysis - Log-linear model for survival analysis
- SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
- SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
- SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
- SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
- ML Pipelines
- SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
- SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
- R API
- SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
- SPARK-9681 Feature interactions in R formula - Interaction operator ":" in R formula
- Python API - Many improvements to Python API to approach feature parity
Misc improvements
- SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
- SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
- SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
- SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
- SPARK-11337 Testable example code - Automated testing for code in user guide examples
Deprecations
- In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
- In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.
Changes of behavior
- spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
- spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
- Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
- When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
- With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Michael Gummelt <mg...@mesosphere.io>.
The fix for the Mesos cluster regression has introduced another Mesos
cluster bug. Namely, the MesosClusterDispatcher crashes when trying to
write to ZK: https://issues.apache.org/jira/browse/SPARK-12413
I have a tentative fix here: https://github.com/apache/spark/pull/10366
On Thu, Dec 17, 2015 at 2:07 PM, Andrew Or <an...@databricks.com> wrote:
> That seems like an HDP-specific issue. I did a quick search on "spark bad
> substitution" and all the results have to do with people failing to run
> YARN cluster in HDP. Here is a workaround
> <https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3CCABoKCLHrRLj6m3w+4Z2OQcOBr-aMZetut8AceNZgQLCs_OG_aw@mail.gmail.com%3E>
> that seems to have worked for multiple people.
>
> I would not block the release on this particular issue. First, this
> doesn't seem like a Spark issue and second, even if it is, this only
> affects a small number of users and there is a workaround for it. In my own
> testing the `extraJavaOptions` are propagated correctly in both YARN client
> and cluster modes.
>
> 2015-12-17 12:36 GMT-08:00 Sebastian YEPES FERNANDEZ <sy...@gmail.com>:
>
>> @Andrew
>> Thanks for the reply, did you run this in a Hortonworks or Cloudera
>> cluster?
>> I suspect the issue is coming from the extraJavaOptions as these are
>> necessary in HDP, the strange thing is that with exactly the same settings
>> 1.5 works.
>>
>> # jar -tf spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar | grep
>> ApplicationMaster.class
>> org/apache/spark/deploy/yarn/ApplicationMaster.class
>>
>> ----
>> Exit code: 1
>> Exception message:
>> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/syepes/appcache/application_1445706872927_1593/container_e44_1445706872927_1593_02_000001/launch_container.sh:
>> line 24:
>> /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.3.2.0-2950.jar:$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
>> bad substitution
>> -----
>>
>> Regards,
>> Sebastian
>>
>> On Thu, Dec 17, 2015 at 9:14 PM, Andrew Or <an...@databricks.com> wrote:
>>
>>> @syepes
>>>
>>> I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to
>>> run a simple application in cluster mode successfully.
>>>
>>> Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
>>> exists in your assembly jar?
>>>
>>> jar -tf assembly.jar | grep ApplicationMaster
>>>
>>> -Andrew
>>>
>>>
>>> 2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
>>>
>>>> -1 (YARN Cluster deployment mode not working)
>>>>
>>>> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the
>>>> cluster
>>>> mode does not seem work. It looks like some parameter are not being
>>>> passed
>>>> correctly.
>>>> This example works correctly with 1.5.
>>>>
>>>> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
>>>> --properties-file $PWD/spark-props.conf --class
>>>> org.apache.spark.examples.SparkPi
>>>> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>>>>
>>>> Error: Could not find or load main class
>>>> org.apache.spark.deploy.yarn.ApplicationMaster
>>>>
>>>> spark-props.conf
>>>> -----------------------------
>>>> spark.driver.
>>>>
>>>> extraJavaOptions -Dhdp.version=2.3.2.0-2950
>>>> spark.driver.extraLibraryPath
>>>>
>>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>>> spark.executor.extraJavaOptions -Dhdp.version=2.3.2.0-2950
>>>> spark.executor.extraLibraryPath
>>>>
>>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>>> -----------------------------
>>>>
>>>> I will try to do some more debugging on this issue.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>
>>
>
--
Michael Gummelt
Software Engineer
Mesosphere
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Andrew Or <an...@databricks.com>.
That seems like an HDP-specific issue. I did a quick search on "spark bad
substitution" and all the results have to do with people failing to run
YARN cluster in HDP. Here is a workaround
<https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3CCABoKCLHrRLj6m3w+4Z2OQcOBr-aMZetut8AceNZgQLCs_OG_aw@mail.gmail.com%3E>
that seems to have worked for multiple people.
I would not block the release on this particular issue. First, this doesn't
seem like a Spark issue and second, even if it is, this only affects a
small number of users and there is a workaround for it. In my own testing
the `extraJavaOptions` are propagated correctly in both YARN client and
cluster modes.
2015-12-17 12:36 GMT-08:00 Sebastian YEPES FERNANDEZ <sy...@gmail.com>:
> @Andrew
> Thanks for the reply, did you run this in a Hortonworks or Cloudera
> cluster?
> I suspect the issue is coming from the extraJavaOptions as these are
> necessary in HDP, the strange thing is that with exactly the same settings
> 1.5 works.
>
> # jar -tf spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar | grep
> ApplicationMaster.class
> org/apache/spark/deploy/yarn/ApplicationMaster.class
>
> ----
> Exit code: 1
> Exception message:
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/syepes/appcache/application_1445706872927_1593/container_e44_1445706872927_1593_02_000001/launch_container.sh:
> line 24:
> /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.3.2.0-2950.jar:$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
> bad substitution
> -----
>
> Regards,
> Sebastian
>
> On Thu, Dec 17, 2015 at 9:14 PM, Andrew Or <an...@databricks.com> wrote:
>
>> @syepes
>>
>> I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to
>> run a simple application in cluster mode successfully.
>>
>> Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
>> exists in your assembly jar?
>>
>> jar -tf assembly.jar | grep ApplicationMaster
>>
>> -Andrew
>>
>>
>> 2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
>>
>>> -1 (YARN Cluster deployment mode not working)
>>>
>>> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
>>> mode does not seem work. It looks like some parameter are not being
>>> passed
>>> correctly.
>>> This example works correctly with 1.5.
>>>
>>> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
>>> --properties-file $PWD/spark-props.conf --class
>>> org.apache.spark.examples.SparkPi
>>> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>>>
>>> Error: Could not find or load main class
>>> org.apache.spark.deploy.yarn.ApplicationMaster
>>>
>>> spark-props.conf
>>> -----------------------------
>>> spark.driver.
>>>
>>> extraJavaOptions -Dhdp.version=2.3.2.0-2950
>>> spark.driver.extraLibraryPath
>>>
>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>> spark.executor.extraJavaOptions -Dhdp.version=2.3.2.0-2950
>>> spark.executor.extraLibraryPath
>>>
>>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>>> -----------------------------
>>>
>>> I will try to do some more debugging on this issue.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Sebastian YEPES FERNANDEZ <sy...@gmail.com>.
@Andrew
Thanks for the reply, did you run this in a Hortonworks or Cloudera cluster?
I suspect the issue is coming from the extraJavaOptions as these are
necessary in HDP, the strange thing is that with exactly the same settings
1.5 works.
# jar -tf spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar | grep
ApplicationMaster.class
org/apache/spark/deploy/yarn/ApplicationMaster.class
----
Exit code: 1
Exception message:
/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/syepes/appcache/application_1445706872927_1593/container_e44_1445706872927_1593_02_000001/launch_container.sh:
line 24:
/usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.3.2.0-2950.jar:$PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:
bad substitution
-----
Regards,
Sebastian
On Thu, Dec 17, 2015 at 9:14 PM, Andrew Or <an...@databricks.com> wrote:
> @syepes
>
> I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to
> run a simple application in cluster mode successfully.
>
> Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
> exists in your assembly jar?
>
> jar -tf assembly.jar | grep ApplicationMaster
>
> -Andrew
>
>
> 2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
>
>> -1 (YARN Cluster deployment mode not working)
>>
>> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
>> mode does not seem work. It looks like some parameter are not being passed
>> correctly.
>> This example works correctly with 1.5.
>>
>> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
>> --properties-file $PWD/spark-props.conf --class
>> org.apache.spark.examples.SparkPi
>> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>>
>> Error: Could not find or load main class
>> org.apache.spark.deploy.yarn.ApplicationMaster
>>
>> spark-props.conf
>> -----------------------------
>> spark.driver.
>>
>> extraJavaOptions -Dhdp.version=2.3.2.0-2950
>> spark.driver.extraLibraryPath
>>
>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>> spark.executor.extraJavaOptions -Dhdp.version=2.3.2.0-2950
>> spark.executor.extraLibraryPath
>>
>> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
>> -----------------------------
>>
>> I will try to do some more debugging on this issue.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Andrew Or <an...@databricks.com>.
@syepes
I just run Spark 1.6 (881f254) on YARN with Hadoop 2.4.0. I was able to run
a simple application in cluster mode successfully.
Can you verify whether the org.apache.spark.yarn.ApplicationMaster class
exists in your assembly jar?
jar -tf assembly.jar | grep ApplicationMaster
-Andrew
2015-12-17 7:44 GMT-08:00 syepes <sy...@gmail.com>:
> -1 (YARN Cluster deployment mode not working)
>
> I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
> mode does not seem work. It looks like some parameter are not being passed
> correctly.
> This example works correctly with 1.5.
>
> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
> --properties-file $PWD/spark-props.conf --class
> org.apache.spark.examples.SparkPi
> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>
> Error: Could not find or load main class
> org.apache.spark.deploy.yarn.ApplicationMaster
>
> spark-props.conf
> -----------------------------
> spark.driver.extraJavaOptions -Dhdp.version=2.3.2.0-2950
> spark.driver.extraLibraryPath
>
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.executor.extraJavaOptions -Dhdp.version=2.3.2.0-2950
> spark.executor.extraLibraryPath
>
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> -----------------------------
>
> I will try to do some more debugging on this issue.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Vinay Shukla <vi...@gmail.com>.
One correction, the better way is to just create a file called java-opts in
.../spark/conf with the following config value in it -Dhdp.version=<version
of HDP>.
One way to get the HDP version is to run the below one lines on a node of
your HDP cluster.
hdp-select status hadoop-client | sed 's/hadoop-client - \(.*\)/\1/'
You can also specify the same value using SPARK_JAVA_OPTS, i.e. export
SPARK_JAVA_OPTS="-Dhdp.version=2.2.5.0-2644"
add the following options to spark-defaults.conf:
spark.driver.extraJavaOptions -Dhdp.version=2.2.5.0-2644
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.5.0-2644
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15701.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Vinay Shukla <vi...@gmail.com>.
Agree with Andrew, we shouldn't block the release for this.
This issue won't be there in Spark distribution from Hortonworks since we
set the HDP version.
If you want to use the Apache Spark with HDP you can modify mapred-site.xml
to replace the hdp.version property with the right value for your cluster.
You can find the right value by invoking the hdp-select script on a node
that has HDP installed. On my system running it returns the following:
hdp-select status hadoop-client
hadoop-client - 2.2.5.0-2644
Here is a one line script to get the version:
export HDP_VER=`hdp-select status hadoop-client | sed 's/hadoop-client -
\(.*\)/\1/'`
CAUTION - if you modify mapred-site.xml on a node on the cluster, this will
break rolling upgrades in certain scenarios where a program like oozie
submitting a job from that node will use the hardcoded version instead of
the version specified by the client.
So what does the Hortonworks distribution do under the covers to support
hdp.version?
create a file called java-opts with the following config value in it
-Dhdp.version=2.2.5.0-2644. You can also specify the same value using
SPARK_JAVA_OPTS, i.e. export SPARK_JAVA_OPTS="-Dhdp.version=2.2.5.0-2644"
add the following options to spark-defaults.conf:
spark.driver.extraJavaOptions -Dhdp.version=2.2.5.0-2644
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.5.0-2644
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15699.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by syepes <sy...@gmail.com>.
-1 (YARN Cluster deployment mode not working)
I have just tested 1.6 (d509194b) on our HDP 2.3 platform and the cluster
mode does not seem work. It looks like some parameter are not being passed
correctly.
This example works correctly with 1.5.
# spark-submit --master yarn --deploy-mode cluster --num-executors 1
--properties-file $PWD/spark-props.conf --class
org.apache.spark.examples.SparkPi
/opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster
spark-props.conf
-----------------------------
spark.driver.extraJavaOptions -Dhdp.version=2.3.2.0-2950
spark.driver.extraLibraryPath
/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraJavaOptions -Dhdp.version=2.3.2.0-2950
spark.executor.extraLibraryPath
/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
-----------------------------
I will try to do some more debugging on this issue.
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC3-tp15660p15692.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Kousuke Saruta <sa...@oss.nttdata.co.jp>.
+1
On 2015/12/17 6:32, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark
> version 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc3-bin/>
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc3-docs/>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1
> votes should only occur for significant regressions from 1.5. Bugs
> already present in 1.5, minor regressions, or bugs related to new
> features will not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go
> into branch-1.6, since documentations will be published separately
> from the release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
>
> Notable changes since 1.6 RC1
>
>
> Spark Streaming
>
> * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> |trackStateByKey| has been renamed to |mapWithState|
>
>
> Spark SQL
>
> * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189
> <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs in
> eviction of storage memory by execution.
> * SPARK-12258
> <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
>
> Notable Features Since 1.5
>
>
> Spark SQL
>
> * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
> Parquet Performance - Improve Parquet scan performance when using
> flat schemas.
> * SPARK-10810
> <https://issues.apache.org/jira/browse/SPARK-10810>Session
> Management - Isolated devault database (i.e |USE mydb|) even on
> shared clusters.
> * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
> Dataset API - A type-safe API (similar to RDDs) that performs many
> operations on serialized binary data and code generation (i.e.
> Project Tungsten).
> * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
> Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197>
> SQL Queries on Files - Concise syntax for running SQL queries over
> files of any supported format without registering a table.
> * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
> Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
> Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329>
> Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
> In-memory Columnar Cache Performance - Significant (up to 14x)
> speed up when caching data that contains complex types in
> DataFrames or SQL.
> * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111>
> Fast null-safe joins - Joins using null-safe equality (|<=>|) will
> now execute using SortMergeJoin instead of computing a cartisian
> product.
> * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389>
> SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
> Datasource API Avoid Double Filter - When implemeting a datasource
> with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
> Advanced Layout of Cached Data - storing partitioning and ordering
> schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
> Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
> Improved query planner for queries having distinct aggregations -
> Query plans of distinct aggregations are more robust when distinct
> columns have high cardinality.
>
>
> Spark Streaming
>
> * API Updates
> o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> New improved state management - |mapWithState| - a DStream
> transformation for stateful stream processing, supercedes
> |updateStateByKey| in functionality and performance.
> o SPARK-11198
> <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
> record deaggregation - Kinesis streams have been upgraded to
> use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> o SPARK-10891
> <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
> message handler function - Allows arbitraray function to be
> applied to a Kinesis record in the Kinesis receiver before to
> customize what data is to be stored in memory.
> o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
> Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> * UI Improvements
> o Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> o Made output operations visible in the streaming tab as
> progress bars.
>
>
> MLlib
>
>
> New algorithms/models
>
> * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
> Survival analysis - Log-linear model for survival analysis
> * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834>
> Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147>
> Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
> Bisecting K-Means clustering - Fast top-down clustering variant of
> K-Means
>
>
> API improvements
>
> * ML Pipelines
> o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
> Pipeline persistence - Save/load for ML Pipelines, with
> partial coverage of spark.ml <http://spark.ml/>algorithms
> o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
> LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML Pipelines
> * R API
> o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
> R-like statistics for GLMs - (Partial) R-like stats for
> ordinary least squares via summary(model)
> o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
> Feature interactions in R formula - Interaction operator ":"
> in R formula
> * Python API - Many improvements to Python API to approach feature
> parity
>
>
> Misc improvements
>
> * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
> Instance weights for GLMs - Logistic and Linear Regression can
> take instance weights
> * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
> Univariate and bivariate statistics in DataFrames - Variance,
> stddev, correlations, etc.
> * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
> LIBSVM data source - LIBSVM as a SQL data source
>
>
> Documentation improvements
>
> * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751>
> @since versions - Documentation includes initial version when
> classes and methods were added
> * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
> Testable example code - Automated testing for code in user guide
> examples
>
>
> Deprecations
>
> * In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> * In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has
> been deprecated, in favor of the new name "coefficients." This
> helps disambiguate from instance (row) weights given to algorithms.
>
>
> Changes of behavior
>
> * spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute
> change in error. Now, it resembles the behavior of GradientDescent
> convergenceTol: For large errors, it uses relative error (relative
> to the previous error); for small errors (< 0.01), it uses
> absolute error.
> * spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to
> lowercase by default, with an option not to. This matches the
> behavior of the simpler Tokenizer transformer.
> * Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e.
> if |path="/my/data/x=1"| then |x=1| will no longer be considered a
> partition but only children of |x=1|.) This behavior can be
> overridden by manually specifying the |basePath| that partitioning
> discovery should start with (SPARK-11678
> <https://issues.apache.org/jira/browse/SPARK-11678>).
> * When casting a value of an integral type to timestamp (e.g.
> casting a long value to timestamp), the value is treated as being
> in seconds instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> * With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a
> more robust version. To switch back to the plan generated by Spark
> 1.5's planner, please set
> |spark.sql.specializeSingleDistinctAggPlanning| to
> |true| (SPARK-12077
> <https://issues.apache.org/jira/browse/SPARK-12077>).
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Tom Graves <tg...@yahoo.com.INVALID>.
+1. Ran some regression tests on Spark on Yarn (hadoop 2.6 and 2.7).
Tom
On Wednesday, December 16, 2015 3:32 PM, Michael Armbrust <mi...@databricks.com> wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at:https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found at:https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
========================================= How can I help test this release? =========================================If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
================================================== What justifies a -1 vote for this release? ==================================================This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.
================================================================= What should happen to JIRA tickets still targeting 1.6.0? =================================================================1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.2. New features for non-alpha-modules should target 1.7+.3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.
==================================================== Major changes to help you focus your testing ====================================================
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
- SPARK-2629 trackStateByKey has been renamed to mapWithState
Spark SQL
- SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
- SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
- SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
- SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
- SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
- SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
- SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
- SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
- SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
- SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
- SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
- SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
- SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
- SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
- SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
- SPARK-9858 Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
- SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
Spark Streaming
- API Updates
- SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
- SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
- SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
- SPARK-6328 Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
- UI Improvements
- Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
- Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
- SPARK-8518 Survival analysis - Log-linear model for survival analysis
- SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
- SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
- SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
- SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
- ML Pipelines
- SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
- SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
- R API
- SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
- SPARK-9681 Feature interactions in R formula - Interaction operator ":" in R formula
- Python API - Many improvements to Python API to approach feature parity
Misc improvements
- SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
- SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
- SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
- SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
- SPARK-11337 Testable example code - Automated testing for code in user guide examples
Deprecations
- In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
- In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.
Changes of behavior
- spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
- spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
- Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
- When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
- With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Mark Grover <ma...@apache.org>.
Thanks Sean for sending me the logs offline.
Turns out the tests are failing again, for reasons unrelated to Spark. I
have filed https://issues.apache.org/jira/browse/SPARK-12426 for that with
some details. In the meanwhile, I agree with Sean, these tests should be
disabled. And, again, I don't think this failures warrants blocking the
release.
Mark
On Fri, Dec 18, 2015 at 9:32 AM, Sean Owen <so...@cloudera.com> wrote:
> Yes that's what I mean. If they're not quite working, let's disable
> them, but first, we have to rule out that I'm not just missing some
> requirement.
>
> Functionally, it's not worth blocking the release. It seems like bad
> form to release with tests that always fail for a non-trivial number
> of users, but we have to establish that. If it's something with an
> easy fix (or needs disabling) and another RC needs to be baked, might
> be worth including.
>
> Logs coming offline
>
> On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover <ma...@apache.org> wrote:
> > Sean,
> > Are you referring to docker integration tests? If so, they were disabled
> for
> > majority of the release and I recently worked on it (SPARK-11796) and
> once
> > it got committed, the tests were re-enabled in Spark builds. I am not
> sure
> > what OSs the test builds use, but it should be passing there too.
> >
> > During my work, I tested on Ubuntu Precise and they worked. If you could
> > share the logs with me offline, I could take a look. Alternatively, I can
> > try to see if I can get Ubuntu 15 instance. However, given the history of
> > these tests, I personally don't think it makes sense to block the release
> > based on them not running on Ubuntu 15.
> >
> > On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> For me, mostly the same as before: tests are mostly passing, but I can
> >> never get the docker tests to pass. If anyone knows a special profile
> >> or package that needs to be enabled, I can try that and/or
> >> fix/document it. Just wondering if it's me.
> >>
> >> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> >> -Phadoop-2.6
> >>
> >> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
> >> <mi...@databricks.com> wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 1.6.0!
> >> >
> >> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> >> > passes
> >> > if a majority of at least 3 +1 PMC votes are cast.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 1.6.0
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v1.6.0-rc3
> >> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >> >
> >> > Release artifacts are signed with the following key:
> >> > https://people.apache.org/keys/committer/pwendell.asc
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1174/
> >> >
> >> > The test repository (versioned as v1.6.0-rc3) for this release can be
> >> > found
> >> > at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1173/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >> >
> >> > =======================================
> >> > == How can I help test this release? ==
> >> > =======================================
> >> > If you are a Spark user, you can help us test this release by taking
> an
> >> > existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > ================================================
> >> > == What justifies a -1 vote for this release? ==
> >> > ================================================
> >> > This vote is happening towards the end of the 1.6 QA period, so -1
> votes
> >> > should only occur for significant regressions from 1.5. Bugs already
> >> > present
> >> > in 1.5, minor regressions, or bugs related to new features will not
> >> > block
> >> > this release.
> >> >
> >> > ===============================================================
> >> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> >> > ===============================================================
> >> > 1. It is OK for documentation patches to target 1.6.0 and still go
> into
> >> > branch-1.6, since documentations will be published separately from the
> >> > release.
> >> > 2. New features for non-alpha-modules should target 1.7+.
> >> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> >> > target
> >> > version.
> >> >
> >> >
> >> > ==================================================
> >> > == Major changes to help you focus your testing ==
> >> > ==================================================
> >> >
> >> > Notable changes since 1.6 RC2
> >> >
> >> >
> >> > - SPARK_VERSION has been set correctly
> >> > - SPARK-12199 ML Docs are publishing correctly
> >> > - SPARK-12345 Mesos cluster mode has been fixed
> >> >
> >> > Notable changes since 1.6 RC1
> >> >
> >> > Spark Streaming
> >> >
> >> > SPARK-2629 trackStateByKey has been renamed to mapWithState
> >> >
> >> > Spark SQL
> >> >
> >> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> >> > execution.
> >> > SPARK-12258 correct passing null into ScalaUDF
> >> >
> >> > Notable Features Since 1.5
> >> >
> >> > Spark SQL
> >> >
> >> > SPARK-11787 Parquet Performance - Improve Parquet scan performance
> when
> >> > using flat schemas.
> >> > SPARK-10810 Session Management - Isolated devault database (i.e USE
> >> > mydb)
> >> > even on shared clusters.
> >> > SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that
> >> > performs
> >> > many operations on serialized binary data and code generation (i.e.
> >> > Project
> >> > Tungsten).
> >> > SPARK-10000 Unified Memory Management - Shared memory for execution
> and
> >> > caching instead of exclusive division of the regions.
> >> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
> >> > queries
> >> > over files of any supported format without registering a table.
> >> > SPARK-11745 Reading non-standard JSON files - Added options to read
> >> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> >> > SPARK-10412 Per-operator Metrics for SQL Execution - Display
> statistics
> >> > on a
> >> > peroperator basis for memory usage and spilled data size.
> >> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to
> nest
> >> > and
> >> > unest arbitrary numbers of columns
> >> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> >> > Significant
> >> > (up to 14x) speed up when caching data that contains complex types in
> >> > DataFrames or SQL.
> >> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality
> (<=>)
> >> > will
> >> > now execute using SortMergeJoin instead of computing a cartisian
> >> > product.
> >> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
> >> > configuring
> >> > query execution to occur using off-heap memory to avoid GC overhead
> >> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> >> > datasource with filter pushdown, developers can now tell Spark SQL to
> >> > avoid
> >> > double evaluating a pushed-down filter.
> >> > SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
> >> > ordering schemes in In-memory table scan, and adding distributeBy and
> >> > localSort to DF API
> >> > SPARK-9858 Adaptive query execution - Intial support for
> automatically
> >> > selecting the number of reducers for joins and aggregations.
> >> > SPARK-9241 Improved query planner for queries having distinct
> >> > aggregations
> >> > - Query plans of distinct aggregations are more robust when distinct
> >> > columns
> >> > have high cardinality.
> >> >
> >> > Spark Streaming
> >> >
> >> > API Updates
> >> >
> >> > SPARK-2629 New improved state management - mapWithState - a DStream
> >> > transformation for stateful stream processing, supercedes
> >> > updateStateByKey
> >> > in functionality and performance.
> >> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> >> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> >> > KPL-aggregated records.
> >> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> >> > function to
> >> > be applied to a Kinesis record in the Kinesis receiver before to
> >> > customize
> >> > what data is to be stored in memory.
> >> > SPARK-6328 Python Streamng Listener API - Get streaming statistics
> >> > (scheduling delays, batch processing times, etc.) in streaming.
> >> >
> >> > UI Improvements
> >> >
> >> > Made failures visible in the streaming tab, in the timelines, batch
> >> > list,
> >> > and batch details page.
> >> > Made output operations visible in the streaming tab as progress bars.
> >> >
> >> > MLlib
> >> >
> >> > New algorithms/models
> >> >
> >> > SPARK-8518 Survival analysis - Log-linear model for survival analysis
> >> > SPARK-9834 Normal equation for least squares - Normal equation
> solver,
> >> > providing R-like model summary statistics
> >> > SPARK-3147 Online hypothesis testing - A/B testing in the Spark
> >> > Streaming
> >> > framework
> >> > SPARK-9930 New feature transformers - ChiSqSelector,
> >> > QuantileDiscretizer,
> >> > SQL transformer
> >> > SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering
> >> > variant
> >> > of K-Means
> >> >
> >> > API improvements
> >> >
> >> > ML Pipelines
> >> >
> >> > SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with
> >> > partial
> >> > coverage of spark.mlalgorithms
> >> > SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation
> in
> >> > ML
> >> > Pipelines
> >> >
> >> > R API
> >> >
> >> > SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for
> >> > ordinary
> >> > least squares via summary(model)
> >> > SPARK-9681 Feature interactions in R formula - Interaction operator
> ":"
> >> > in
> >> > R formula
> >> >
> >> > Python API - Many improvements to Python API to approach feature
> parity
> >> >
> >> > Misc improvements
> >> >
> >> > SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and
> Linear
> >> > Regression can take instance weights
> >> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> >> > DataFrames -
> >> > Variance, stddev, correlations, etc.
> >> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >> >
> >> > Documentation improvements
> >> >
> >> > SPARK-7751 @since versions - Documentation includes initial version
> >> > when
> >> > classes and methods were added
> >> > SPARK-11337 Testable example code - Automated testing for code in user
> >> > guide
> >> > examples
> >> >
> >> > Deprecations
> >> >
> >> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> >> > deprecated.
> >> > In spark.ml.classification.LogisticRegressionModel and
> >> > spark.ml.regression.LinearRegressionModel, the "weights" field has
> been
> >> > deprecated, in favor of the new name "coefficients." This helps
> >> > disambiguate
> >> > from instance (row) weights given to algorithms.
> >> >
> >> > Changes of behavior
> >> >
> >> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> >> > semantics in
> >> > 1.6. Previously, it was a threshold for absolute change in error. Now,
> >> > it
> >> > resembles the behavior of GradientDescent convergenceTol: For large
> >> > errors,
> >> > it uses relative error (relative to the previous error); for small
> >> > errors (<
> >> > 0.01), it uses absolute error.
> >> > spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings
> >> > to
> >> > lowercase before tokenizing. Now, it converts to lowercase by default,
> >> > with
> >> > an option not to. This matches the behavior of the simpler Tokenizer
> >> > transformer.
> >> > Spark SQL's partition discovery has been changed to only discover
> >> > partition
> >> > directories that are children of the given path. (i.e. if
> >> > path="/my/data/x=1" then x=1 will no longer be considered a partition
> >> > but
> >> > only children of x=1.) This behavior can be overridden by manually
> >> > specifying the basePath that partitioning discovery should start with
> >> > (SPARK-11678).
> >> > When casting a value of an integral type to timestamp (e.g. casting a
> >> > long
> >> > value to timestamp), the value is treated as being in seconds instead
> of
> >> > milliseconds (SPARK-11724).
> >> > With the improved query planner for queries having distinct
> aggregations
> >> > (SPARK-9241), the plan of a query having a single distinct aggregation
> >> > has
> >> > been changed to a more robust version. To switch back to the plan
> >> > generated
> >> > by Spark 1.5's planner, please set
> >> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Sean Owen <so...@cloudera.com>.
Yes that's what I mean. If they're not quite working, let's disable
them, but first, we have to rule out that I'm not just missing some
requirement.
Functionally, it's not worth blocking the release. It seems like bad
form to release with tests that always fail for a non-trivial number
of users, but we have to establish that. If it's something with an
easy fix (or needs disabling) and another RC needs to be baked, might
be worth including.
Logs coming offline
On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover <ma...@apache.org> wrote:
> Sean,
> Are you referring to docker integration tests? If so, they were disabled for
> majority of the release and I recently worked on it (SPARK-11796) and once
> it got committed, the tests were re-enabled in Spark builds. I am not sure
> what OSs the test builds use, but it should be passing there too.
>
> During my work, I tested on Ubuntu Precise and they worked. If you could
> share the logs with me offline, I could take a look. Alternatively, I can
> try to see if I can get Ubuntu 15 instance. However, given the history of
> these tests, I personally don't think it makes sense to block the release
> based on them not running on Ubuntu 15.
>
> On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> For me, mostly the same as before: tests are mostly passing, but I can
>> never get the docker tests to pass. If anyone knows a special profile
>> or package that needs to be enabled, I can try that and/or
>> fix/document it. Just wondering if it's me.
>>
>> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
>> -Phadoop-2.6
>>
>> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
>> <mi...@databricks.com> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> > passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc3
>> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1174/
>> >
>> > The test repository (versioned as v1.6.0-rc3) for this release can be
>> > found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1173/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>> >
>> > =======================================
>> > == How can I help test this release? ==
>> > =======================================
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > ================================================
>> > == What justifies a -1 vote for this release? ==
>> > ================================================
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> > present
>> > in 1.5, minor regressions, or bugs related to new features will not
>> > block
>> > this release.
>> >
>> > ===============================================================
>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>> > ===============================================================
>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>> > branch-1.6, since documentations will be published separately from the
>> > release.
>> > 2. New features for non-alpha-modules should target 1.7+.
>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> > target
>> > version.
>> >
>> >
>> > ==================================================
>> > == Major changes to help you focus your testing ==
>> > ==================================================
>> >
>> > Notable changes since 1.6 RC2
>> >
>> >
>> > - SPARK_VERSION has been set correctly
>> > - SPARK-12199 ML Docs are publishing correctly
>> > - SPARK-12345 Mesos cluster mode has been fixed
>> >
>> > Notable changes since 1.6 RC1
>> >
>> > Spark Streaming
>> >
>> > SPARK-2629 trackStateByKey has been renamed to mapWithState
>> >
>> > Spark SQL
>> >
>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>> > execution.
>> > SPARK-12258 correct passing null into ScalaUDF
>> >
>> > Notable Features Since 1.5
>> >
>> > Spark SQL
>> >
>> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
>> > using flat schemas.
>> > SPARK-10810 Session Management - Isolated devault database (i.e USE
>> > mydb)
>> > even on shared clusters.
>> > SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that
>> > performs
>> > many operations on serialized binary data and code generation (i.e.
>> > Project
>> > Tungsten).
>> > SPARK-10000 Unified Memory Management - Shared memory for execution and
>> > caching instead of exclusive division of the regions.
>> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
>> > queries
>> > over files of any supported format without registering a table.
>> > SPARK-11745 Reading non-standard JSON files - Added options to read
>> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
>> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
>> > on a
>> > peroperator basis for memory usage and spilled data size.
>> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
>> > and
>> > unest arbitrary numbers of columns
>> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
>> > Significant
>> > (up to 14x) speed up when caching data that contains complex types in
>> > DataFrames or SQL.
>> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
>> > will
>> > now execute using SortMergeJoin instead of computing a cartisian
>> > product.
>> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for
>> > configuring
>> > query execution to occur using off-heap memory to avoid GC overhead
>> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
>> > datasource with filter pushdown, developers can now tell Spark SQL to
>> > avoid
>> > double evaluating a pushed-down filter.
>> > SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
>> > ordering schemes in In-memory table scan, and adding distributeBy and
>> > localSort to DF API
>> > SPARK-9858 Adaptive query execution - Intial support for automatically
>> > selecting the number of reducers for joins and aggregations.
>> > SPARK-9241 Improved query planner for queries having distinct
>> > aggregations
>> > - Query plans of distinct aggregations are more robust when distinct
>> > columns
>> > have high cardinality.
>> >
>> > Spark Streaming
>> >
>> > API Updates
>> >
>> > SPARK-2629 New improved state management - mapWithState - a DStream
>> > transformation for stateful stream processing, supercedes
>> > updateStateByKey
>> > in functionality and performance.
>> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
>> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>> > KPL-aggregated records.
>> > SPARK-10891 Kinesis message handler function - Allows arbitraray
>> > function to
>> > be applied to a Kinesis record in the Kinesis receiver before to
>> > customize
>> > what data is to be stored in memory.
>> > SPARK-6328 Python Streamng Listener API - Get streaming statistics
>> > (scheduling delays, batch processing times, etc.) in streaming.
>> >
>> > UI Improvements
>> >
>> > Made failures visible in the streaming tab, in the timelines, batch
>> > list,
>> > and batch details page.
>> > Made output operations visible in the streaming tab as progress bars.
>> >
>> > MLlib
>> >
>> > New algorithms/models
>> >
>> > SPARK-8518 Survival analysis - Log-linear model for survival analysis
>> > SPARK-9834 Normal equation for least squares - Normal equation solver,
>> > providing R-like model summary statistics
>> > SPARK-3147 Online hypothesis testing - A/B testing in the Spark
>> > Streaming
>> > framework
>> > SPARK-9930 New feature transformers - ChiSqSelector,
>> > QuantileDiscretizer,
>> > SQL transformer
>> > SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering
>> > variant
>> > of K-Means
>> >
>> > API improvements
>> >
>> > ML Pipelines
>> >
>> > SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with
>> > partial
>> > coverage of spark.mlalgorithms
>> > SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>> > ML
>> > Pipelines
>> >
>> > R API
>> >
>> > SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for
>> > ordinary
>> > least squares via summary(model)
>> > SPARK-9681 Feature interactions in R formula - Interaction operator ":"
>> > in
>> > R formula
>> >
>> > Python API - Many improvements to Python API to approach feature parity
>> >
>> > Misc improvements
>> >
>> > SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear
>> > Regression can take instance weights
>> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
>> > DataFrames -
>> > Variance, stddev, correlations, etc.
>> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>> >
>> > Documentation improvements
>> >
>> > SPARK-7751 @since versions - Documentation includes initial version
>> > when
>> > classes and methods were added
>> > SPARK-11337 Testable example code - Automated testing for code in user
>> > guide
>> > examples
>> >
>> > Deprecations
>> >
>> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> > deprecated.
>> > In spark.ml.classification.LogisticRegressionModel and
>> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> > deprecated, in favor of the new name "coefficients." This helps
>> > disambiguate
>> > from instance (row) weights given to algorithms.
>> >
>> > Changes of behavior
>> >
>> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> > semantics in
>> > 1.6. Previously, it was a threshold for absolute change in error. Now,
>> > it
>> > resembles the behavior of GradientDescent convergenceTol: For large
>> > errors,
>> > it uses relative error (relative to the previous error); for small
>> > errors (<
>> > 0.01), it uses absolute error.
>> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
>> > to
>> > lowercase before tokenizing. Now, it converts to lowercase by default,
>> > with
>> > an option not to. This matches the behavior of the simpler Tokenizer
>> > transformer.
>> > Spark SQL's partition discovery has been changed to only discover
>> > partition
>> > directories that are children of the given path. (i.e. if
>> > path="/my/data/x=1" then x=1 will no longer be considered a partition
>> > but
>> > only children of x=1.) This behavior can be overridden by manually
>> > specifying the basePath that partitioning discovery should start with
>> > (SPARK-11678).
>> > When casting a value of an integral type to timestamp (e.g. casting a
>> > long
>> > value to timestamp), the value is treated as being in seconds instead of
>> > milliseconds (SPARK-11724).
>> > With the improved query planner for queries having distinct aggregations
>> > (SPARK-9241), the plan of a query having a single distinct aggregation
>> > has
>> > been changed to a more robust version. To switch back to the plan
>> > generated
>> > by Spark 1.5's planner, please set
>> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Mark Grover <ma...@apache.org>.
Sean,
Are you referring to docker integration tests? If so, they were disabled
for majority of the release and I recently worked on it (SPARK-11796) and
once it got committed, the tests were re-enabled in Spark builds. I am not
sure what OSs the test builds use, but it should be passing there too.
During my work, I tested on Ubuntu Precise and they worked. If you could
share the logs with me offline, I could take a look. Alternatively, I can
try to see if I can get Ubuntu 15 instance. However, given the history of
these tests, I personally don't think it makes sense to block the release
based on them not running on Ubuntu 15.
On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen <so...@cloudera.com> wrote:
> For me, mostly the same as before: tests are mostly passing, but I can
> never get the docker tests to pass. If anyone knows a special profile
> or package that needs to be enabled, I can try that and/or
> fix/document it. Just wondering if it's me.
>
> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> -Phadoop-2.6
>
> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
> <mi...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc3
> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1174/
> >
> > The test repository (versioned as v1.6.0-rc3) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1173/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >
> > =======================================
> > == How can I help test this release? ==
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ================================================
> > == What justifies a -1 vote for this release? ==
> > ================================================
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===============================================================
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > == Major changes to help you focus your testing ==
> > ==================================================
> >
> > Notable changes since 1.6 RC2
> >
> >
> > - SPARK_VERSION has been set correctly
> > - SPARK-12199 ML Docs are publishing correctly
> > - SPARK-12345 Mesos cluster mode has been fixed
> >
> > Notable changes since 1.6 RC1
> >
> > Spark Streaming
> >
> > SPARK-2629 trackStateByKey has been renamed to mapWithState
> >
> > Spark SQL
> >
> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> execution.
> > SPARK-12258 correct passing null into ScalaUDF
> >
> > Notable Features Since 1.5
> >
> > Spark SQL
> >
> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> > using flat schemas.
> > SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> > even on shared clusters.
> > SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-10000 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard JSON files - Added options to read
> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
> on a
> > peroperator basis for memory usage and spilled data size.
> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
> and
> > unest arbitrary numbers of columns
> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> Significant
> > (up to 14x) speed up when caching data that contains complex types in
> > DataFrames or SQL.
> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
> will
> > now execute using SortMergeJoin instead of computing a cartisian product.
> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> > query execution to occur using off-heap memory to avoid GC overhead
> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> > datasource with filter pushdown, developers can now tell Spark SQL to
> avoid
> > double evaluating a pushed-down filter.
> > SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
> > ordering schemes in In-memory table scan, and adding distributeBy and
> > localSort to DF API
> > SPARK-9858 Adaptive query execution - Intial support for automatically
> > selecting the number of reducers for joins and aggregations.
> > SPARK-9241 Improved query planner for queries having distinct
> aggregations
> > - Query plans of distinct aggregations are more robust when distinct
> columns
> > have high cardinality.
> >
> > Spark Streaming
> >
> > API Updates
> >
> > SPARK-2629 New improved state management - mapWithState - a DStream
> > transformation for stateful stream processing, supercedes
> updateStateByKey
> > in functionality and performance.
> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> > KPL-aggregated records.
> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> function to
> > be applied to a Kinesis record in the Kinesis receiver before to
> customize
> > what data is to be stored in memory.
> > SPARK-6328 Python Streamng Listener API - Get streaming statistics
> > (scheduling delays, batch processing times, etc.) in streaming.
> >
> > UI Improvements
> >
> > Made failures visible in the streaming tab, in the timelines, batch list,
> > and batch details page.
> > Made output operations visible in the streaming tab as progress bars.
> >
> > MLlib
> >
> > New algorithms/models
> >
> > SPARK-8518 Survival analysis - Log-linear model for survival analysis
> > SPARK-9834 Normal equation for least squares - Normal equation solver,
> > providing R-like model summary statistics
> > SPARK-3147 Online hypothesis testing - A/B testing in the Spark
> Streaming
> > framework
> > SPARK-9930 New feature transformers - ChiSqSelector,
> QuantileDiscretizer,
> > SQL transformer
> > SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering
> variant
> > of K-Means
> >
> > API improvements
> >
> > ML Pipelines
> >
> > SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with
> partial
> > coverage of spark.mlalgorithms
> > SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML
> > Pipelines
> >
> > R API
> >
> > SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for
> ordinary
> > least squares via summary(model)
> > SPARK-9681 Feature interactions in R formula - Interaction operator ":"
> in
> > R formula
> >
> > Python API - Many improvements to Python API to approach feature parity
> >
> > Misc improvements
> >
> > SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear
> > Regression can take instance weights
> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> DataFrames -
> > Variance, stddev, correlations, etc.
> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >
> > Documentation improvements
> >
> > SPARK-7751 @since versions - Documentation includes initial version when
> > classes and methods were added
> > SPARK-11337 Testable example code - Automated testing for code in user
> guide
> > examples
> >
> > Deprecations
> >
> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> > In spark.ml.classification.LogisticRegressionModel and
> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
> > deprecated, in favor of the new name "coefficients." This helps
> disambiguate
> > from instance (row) weights given to algorithms.
> >
> > Changes of behavior
> >
> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in
> > 1.6. Previously, it was a threshold for absolute change in error. Now, it
> > resembles the behavior of GradientDescent convergenceTol: For large
> errors,
> > it uses relative error (relative to the previous error); for small
> errors (<
> > 0.01), it uses absolute error.
> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
> to
> > lowercase before tokenizing. Now, it converts to lowercase by default,
> with
> > an option not to. This matches the behavior of the simpler Tokenizer
> > transformer.
> > Spark SQL's partition discovery has been changed to only discover
> partition
> > directories that are children of the given path. (i.e. if
> > path="/my/data/x=1" then x=1 will no longer be considered a partition but
> > only children of x=1.) This behavior can be overridden by manually
> > specifying the basePath that partitioning discovery should start with
> > (SPARK-11678).
> > When casting a value of an integral type to timestamp (e.g. casting a
> long
> > value to timestamp), the value is treated as being in seconds instead of
> > milliseconds (SPARK-11724).
> > With the improved query planner for queries having distinct aggregations
> > (SPARK-9241), the plan of a query having a single distinct aggregation
> has
> > been changed to a more robust version. To switch back to the plan
> generated
> > by Spark 1.5's planner, please set
> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Sean Owen <so...@cloudera.com>.
For me, mostly the same as before: tests are mostly passing, but I can
never get the docker tests to pass. If anyone knows a special profile
or package that needs to be enabled, I can try that and/or
fix/document it. Just wondering if it's me.
I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
-Phadoop-2.6
On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629 trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858 Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241 Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629 New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328 Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518 Survival analysis - Log-linear model for survival analysis
> SPARK-9834 Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.mlalgorithms
> SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681 Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751 @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Zsolt Tóth <to...@gmail.com>.
+1 (non-binding)
Testing environment:
-CDH5.5 single node docker
-Prebuilt spark-1.6.0-hadoop2.6.tgz
-Yarn-cluster mode
Comparing outputs of Spark 1.5.x and 1.6.0-RC3:
Pyspark
OK?: K-Means (ml) - Note: our tests show a numerical diff here compared to
the 1.5.2 output. Since K-Means has a random factor, this can be expected
behaviour - is it because of SPARK-10779? If so, I think it should be
listed in the Mllib/ml docs.
OK: Logistic Regression (ml), Linear Regression (mllib)
OK: Nested Spark SQL query
SparkR
OK: Logistic Regression
OK: Nested Spark SQL query
Machine learning - Java:
OK: Decision Tree (mllib and ml): Gini, Entropy
OK. Random Forest (ml): Gini, Entropy
OK: Linear, Lasso, Ridge Regression (mllib)
OK: Logistc Regression (mllib): SGD, L-BFGS
OK: SVM (mllib)
I/O:
OK: Reading/Writing Parquet to/from DataFrame
OK: Reading/Writing Textfile to/from RDD
2015-12-19 2:09 GMT+01:00 Marcelo Vanzin <va...@cloudera.com>:
> +1 (non-binding)
>
> Tests the without-hadoop binaries (so didn't run Hive-related tests)
> with a test batch including standalone / client, yarn / client and
> cluster, including core, mllib and streaming (flume and kafka).
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust
> <mi...@databricks.com> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.0!
> >
> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v1.6.0-rc3
> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1174/
> >
> > The test repository (versioned as v1.6.0-rc3) for this release can be
> found
> > at:
> > https://repository.apache.org/content/repositories/orgapachespark-1173/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >
> > =======================================
> > == How can I help test this release? ==
> > =======================================
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ================================================
> > == What justifies a -1 vote for this release? ==
> > ================================================
> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
> > should only occur for significant regressions from 1.5. Bugs already
> present
> > in 1.5, minor regressions, or bugs related to new features will not block
> > this release.
> >
> > ===============================================================
> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> > ===============================================================
> > 1. It is OK for documentation patches to target 1.6.0 and still go into
> > branch-1.6, since documentations will be published separately from the
> > release.
> > 2. New features for non-alpha-modules should target 1.7+.
> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> > version.
> >
> >
> > ==================================================
> > == Major changes to help you focus your testing ==
> > ==================================================
> >
> > Notable changes since 1.6 RC2
> >
> >
> > - SPARK_VERSION has been set correctly
> > - SPARK-12199 ML Docs are publishing correctly
> > - SPARK-12345 Mesos cluster mode has been fixed
> >
> > Notable changes since 1.6 RC1
> >
> > Spark Streaming
> >
> > SPARK-2629 trackStateByKey has been renamed to mapWithState
> >
> > Spark SQL
> >
> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
> execution.
> > SPARK-12258 correct passing null into ScalaUDF
> >
> > Notable Features Since 1.5
> >
> > Spark SQL
> >
> > SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> > using flat schemas.
> > SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> > even on shared clusters.
> > SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs
> > many operations on serialized binary data and code generation (i.e.
> Project
> > Tungsten).
> > SPARK-10000 Unified Memory Management - Shared memory for execution and
> > caching instead of exclusive division of the regions.
> > SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> > over files of any supported format without registering a table.
> > SPARK-11745 Reading non-standard JSON files - Added options to read
> > non-standard JSON files (e.g. single-quotes, unquoted attributes)
> > SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics
> on a
> > peroperator basis for memory usage and spilled data size.
> > SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest
> and
> > unest arbitrary numbers of columns
> > SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance -
> Significant
> > (up to 14x) speed up when caching data that contains complex types in
> > DataFrames or SQL.
> > SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>)
> will
> > now execute using SortMergeJoin instead of computing a cartisian product.
> > SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> > query execution to occur using off-heap memory to avoid GC overhead
> > SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> > datasource with filter pushdown, developers can now tell Spark SQL to
> avoid
> > double evaluating a pushed-down filter.
> > SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
> > ordering schemes in In-memory table scan, and adding distributeBy and
> > localSort to DF API
> > SPARK-9858 Adaptive query execution - Intial support for automatically
> > selecting the number of reducers for joins and aggregations.
> > SPARK-9241 Improved query planner for queries having distinct
> aggregations
> > - Query plans of distinct aggregations are more robust when distinct
> columns
> > have high cardinality.
> >
> > Spark Streaming
> >
> > API Updates
> >
> > SPARK-2629 New improved state management - mapWithState - a DStream
> > transformation for stateful stream processing, supercedes
> updateStateByKey
> > in functionality and performance.
> > SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> > upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> > KPL-aggregated records.
> > SPARK-10891 Kinesis message handler function - Allows arbitraray
> function to
> > be applied to a Kinesis record in the Kinesis receiver before to
> customize
> > what data is to be stored in memory.
> > SPARK-6328 Python Streamng Listener API - Get streaming statistics
> > (scheduling delays, batch processing times, etc.) in streaming.
> >
> > UI Improvements
> >
> > Made failures visible in the streaming tab, in the timelines, batch list,
> > and batch details page.
> > Made output operations visible in the streaming tab as progress bars.
> >
> > MLlib
> >
> > New algorithms/models
> >
> > SPARK-8518 Survival analysis - Log-linear model for survival analysis
> > SPARK-9834 Normal equation for least squares - Normal equation solver,
> > providing R-like model summary statistics
> > SPARK-3147 Online hypothesis testing - A/B testing in the Spark
> Streaming
> > framework
> > SPARK-9930 New feature transformers - ChiSqSelector,
> QuantileDiscretizer,
> > SQL transformer
> > SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering
> variant
> > of K-Means
> >
> > API improvements
> >
> > ML Pipelines
> >
> > SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with
> partial
> > coverage of spark.mlalgorithms
> > SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in
> ML
> > Pipelines
> >
> > R API
> >
> > SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for
> ordinary
> > least squares via summary(model)
> > SPARK-9681 Feature interactions in R formula - Interaction operator ":"
> in
> > R formula
> >
> > Python API - Many improvements to Python API to approach feature parity
> >
> > Misc improvements
> >
> > SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear
> > Regression can take instance weights
> > SPARK-10384, SPARK-10385 Univariate and bivariate statistics in
> DataFrames -
> > Variance, stddev, correlations, etc.
> > SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
> >
> > Documentation improvements
> >
> > SPARK-7751 @since versions - Documentation includes initial version when
> > classes and methods were added
> > SPARK-11337 Testable example code - Automated testing for code in user
> guide
> > examples
> >
> > Deprecations
> >
> > In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> > In spark.ml.classification.LogisticRegressionModel and
> > spark.ml.regression.LinearRegressionModel, the "weights" field has been
> > deprecated, in favor of the new name "coefficients." This helps
> disambiguate
> > from instance (row) weights given to algorithms.
> >
> > Changes of behavior
> >
> > spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in
> > 1.6. Previously, it was a threshold for absolute change in error. Now, it
> > resembles the behavior of GradientDescent convergenceTol: For large
> errors,
> > it uses relative error (relative to the previous error); for small
> errors (<
> > 0.01), it uses absolute error.
> > spark.ml.feature.RegexTokenizer: Previously, it did not convert strings
> to
> > lowercase before tokenizing. Now, it converts to lowercase by default,
> with
> > an option not to. This matches the behavior of the simpler Tokenizer
> > transformer.
> > Spark SQL's partition discovery has been changed to only discover
> partition
> > directories that are children of the given path. (i.e. if
> > path="/my/data/x=1" then x=1 will no longer be considered a partition but
> > only children of x=1.) This behavior can be overridden by manually
> > specifying the basePath that partitioning discovery should start with
> > (SPARK-11678).
> > When casting a value of an integral type to timestamp (e.g. casting a
> long
> > value to timestamp), the value is treated as being in seconds instead of
> > milliseconds (SPARK-11724).
> > With the improved query planner for queries having distinct aggregations
> > (SPARK-9241), the plan of a query having a single distinct aggregation
> has
> > been changed to a more robust version. To switch back to the plan
> generated
> > by Spark 1.5's planner, please set
> > spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Marcelo Vanzin <va...@cloudera.com>.
+1 (non-binding)
Tests the without-hadoop binaries (so didn't run Hive-related tests)
with a test batch including standalone / client, yarn / client and
cluster, including core, mllib and streaming (flume and kafka).
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust
<mi...@databricks.com> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629 trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-10000 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double evaluating a pushed-down filter.
> SPARK-4849 Advanced Layout of Cached Data - storing partitioning and
> ordering schemes in In-memory table scan, and adding distributeBy and
> localSort to DF API
> SPARK-9858 Adaptive query execution - Intial support for automatically
> selecting the number of reducers for joins and aggregations.
> SPARK-9241 Improved query planner for queries having distinct aggregations
> - Query plans of distinct aggregations are more robust when distinct columns
> have high cardinality.
>
> Spark Streaming
>
> API Updates
>
> SPARK-2629 New improved state management - mapWithState - a DStream
> transformation for stateful stream processing, supercedes updateStateByKey
> in functionality and performance.
> SPARK-11198 Kinesis record deaggregation - Kinesis streams have been
> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
> KPL-aggregated records.
> SPARK-10891 Kinesis message handler function - Allows arbitraray function to
> be applied to a Kinesis record in the Kinesis receiver before to customize
> what data is to be stored in memory.
> SPARK-6328 Python Streamng Listener API - Get streaming statistics
> (scheduling delays, batch processing times, etc.) in streaming.
>
> UI Improvements
>
> Made failures visible in the streaming tab, in the timelines, batch list,
> and batch details page.
> Made output operations visible in the streaming tab as progress bars.
>
> MLlib
>
> New algorithms/models
>
> SPARK-8518 Survival analysis - Log-linear model for survival analysis
> SPARK-9834 Normal equation for least squares - Normal equation solver,
> providing R-like model summary statistics
> SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming
> framework
> SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,
> SQL transformer
> SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant
> of K-Means
>
> API improvements
>
> ML Pipelines
>
> SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial
> coverage of spark.mlalgorithms
> SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
>
> R API
>
> SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary
> least squares via summary(model)
> SPARK-9681 Feature interactions in R formula - Interaction operator ":" in
> R formula
>
> Python API - Many improvements to Python API to approach feature parity
>
> Misc improvements
>
> SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear
> Regression can take instance weights
> SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames -
> Variance, stddev, correlations, etc.
> SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
>
> Documentation improvements
>
> SPARK-7751 @since versions - Documentation includes initial version when
> classes and methods were added
> SPARK-11337 Testable example code - Automated testing for code in user guide
> examples
>
> Deprecations
>
> In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
> In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps disambiguate
> from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in
> 1.6. Previously, it was a threshold for absolute change in error. Now, it
> resembles the behavior of GradientDescent convergenceTol: For large errors,
> it uses relative error (relative to the previous error); for small errors (<
> 0.01), it uses absolute error.
> spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to
> lowercase before tokenizing. Now, it converts to lowercase by default, with
> an option not to. This matches the behavior of the simpler Tokenizer
> transformer.
> Spark SQL's partition discovery has been changed to only discover partition
> directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition but
> only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with
> (SPARK-11678).
> When casting a value of an integral type to timestamp (e.g. casting a long
> value to timestamp), the value is treated as being in seconds instead of
> milliseconds (SPARK-11724).
> With the improved query planner for queries having distinct aggregations
> (SPARK-9241), the plan of a query having a single distinct aggregation has
> been changed to a more robust version. To switch back to the plan generated
> by Spark 1.5's planner, please set
> spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
--
Marcelo
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Saisai Shao <sa...@gmail.com>.
+1 (non-binding) after SPARK-12345 is merged.
On Thu, Dec 17, 2015 at 9:55 AM, Allen Zhang <al...@126.com> wrote:
> plus 1
>
>
>
>
>
>
> 在 2015-12-17 09:39:39,"Joseph Bradley" <jo...@databricks.com> 写道:
>
> +1
>
> On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> +1
>>
>>
>> On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> +1
>>>
>>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.0!
>>>>
>>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is *v1.6.0-rc3
>>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>>
>>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>>> found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>>
>>>> =======================================
>>>> == How can I help test this release? ==
>>>> =======================================
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> ================================================
>>>> == What justifies a -1 vote for this release? ==
>>>> ================================================
>>>> This vote is happening towards the end of the 1.6 QA period, so -1
>>>> votes should only occur for significant regressions from 1.5. Bugs already
>>>> present in 1.5, minor regressions, or bugs related to new features will not
>>>> block this release.
>>>>
>>>> ===============================================================
>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>> ===============================================================
>>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>>> branch-1.6, since documentations will be published separately from the
>>>> release.
>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>>> target version.
>>>>
>>>>
>>>> ==================================================
>>>> == Major changes to help you focus your testing ==
>>>> ==================================================
>>>>
>>>> Notable changes since 1.6 RC2
>>>> - SPARK_VERSION has been set correctly
>>>> - SPARK-12199 ML Docs are publishing correctly
>>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>>
>>>> Notable changes since 1.6 RC1
>>>> Spark Streaming
>>>>
>>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>>> trackStateByKey has been renamed to mapWithState
>>>>
>>>> Spark SQL
>>>>
>>>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>>> bugs in eviction of storage memory by execution.
>>>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>>> passing null into ScalaUDF
>>>>
>>>> Notable Features Since 1.5Spark SQL
>>>>
>>>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>>> Performance - Improve Parquet scan performance when using flat
>>>> schemas.
>>>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>>> Session Management - Isolated devault database (i.e USE mydb) even
>>>> on shared clusters.
>>>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>>> API - A type-safe API (similar to RDDs) that performs many
>>>> operations on serialized binary data and code generation (i.e. Project
>>>> Tungsten).
>>>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>>> Memory Management - Shared memory for execution and caching instead
>>>> of exclusive division of the regions.
>>>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>>> Queries on Files - Concise syntax for running SQL queries over
>>>> files of any supported format without registering a table.
>>>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>>> non-standard JSON files - Added options to read non-standard JSON
>>>> files (e.g. single-quotes, unquoted attributes)
>>>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>>> Metrics for SQL Execution - Display statistics on a peroperator
>>>> basis for memory usage and spilled data size.
>>>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>>> (*) expansion for StructTypes - Makes it easier to nest and unest
>>>> arbitrary numbers of columns
>>>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>>> Columnar Cache Performance - Significant (up to 14x) speed up when
>>>> caching data that contains complex types in DataFrames or SQL.
>>>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>>> null-safe joins - Joins using null-safe equality (<=>) will now
>>>> execute using SortMergeJoin instead of computing a cartisian product.
>>>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>>> Execution Using Off-Heap Memory - Support for configuring query
>>>> execution to occur using off-heap memory to avoid GC overhead
>>>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>>> API Avoid Double Filter - When implemeting a datasource with filter
>>>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>>> pushed-down filter.
>>>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>>> Layout of Cached Data - storing partitioning and ordering schemes
>>>> in In-memory table scan, and adding distributeBy and localSort to DF API
>>>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>>> query execution - Intial support for automatically selecting the
>>>> number of reducers for joins and aggregations.
>>>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>>> query planner for queries having distinct aggregations - Query
>>>> plans of distinct aggregations are more robust when distinct columns have
>>>> high cardinality.
>>>>
>>>> Spark Streaming
>>>>
>>>> - API Updates
>>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>>>> improved state management - mapWithState - a DStream
>>>> transformation for stateful stream processing, supercedes
>>>> updateStateByKey in functionality and performance.
>>>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198>
>>>> Kinesis record deaggregation - Kinesis streams have been
>>>> upgraded to use KCL 1.4.0 and supports transparent deaggregation of
>>>> KPL-aggregated records.
>>>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891>
>>>> Kinesis message handler function - Allows arbitraray function
>>>> to be applied to a Kinesis record in the Kinesis receiver before to
>>>> customize what data is to be stored in memory.
>>>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>>> Streamng Listener API - Get streaming statistics (scheduling
>>>> delays, batch processing times, etc.) in streaming.
>>>>
>>>>
>>>> - UI Improvements
>>>> - Made failures visible in the streaming tab, in the timelines,
>>>> batch list, and batch details page.
>>>> - Made output operations visible in the streaming tab as
>>>> progress bars.
>>>>
>>>> MLlibNew algorithms/models
>>>>
>>>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>>> analysis - Log-linear model for survival analysis
>>>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>>> equation for least squares - Normal equation solver, providing
>>>> R-like model summary statistics
>>>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>>> hypothesis testing - A/B testing in the Spark Streaming framework
>>>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>>>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>>> transformer
>>>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>>> K-Means clustering - Fast top-down clustering variant of K-Means
>>>>
>>>> API improvements
>>>>
>>>> - ML Pipelines
>>>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>>> persistence - Save/load for ML Pipelines, with partial coverage
>>>> of spark.mlalgorithms
>>>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>>> Pipelines
>>>> - R API
>>>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>>> statistics for GLMs - (Partial) R-like stats for ordinary least
>>>> squares via summary(model)
>>>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>>> interactions in R formula - Interaction operator ":" in R formula
>>>> - Python API - Many improvements to Python API to approach feature
>>>> parity
>>>>
>>>> Misc improvements
>>>>
>>>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>>>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>>> weights for GLMs - Logistic and Linear Regression can take instance
>>>> weights
>>>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>>> and bivariate statistics in DataFrames - Variance, stddev,
>>>> correlations, etc.
>>>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>>>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>>> versions - Documentation includes initial version when classes and
>>>> methods were added
>>>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>>> example code - Automated testing for code in user guide examples
>>>>
>>>> Deprecations
>>>>
>>>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>>> deprecated.
>>>> - In spark.ml.classification.LogisticRegressionModel and
>>>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>>> deprecated, in favor of the new name "coefficients." This helps
>>>> disambiguate from instance (row) weights given to algorithms.
>>>>
>>>> Changes of behavior
>>>>
>>>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>>> semantics in 1.6. Previously, it was a threshold for absolute change in
>>>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>>> For large errors, it uses relative error (relative to the previous error);
>>>> for small errors (< 0.01), it uses absolute error.
>>>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>>>> default, with an option not to. This matches the behavior of the simpler
>>>> Tokenizer transformer.
>>>> - Spark SQL's partition discovery has been changed to only discover
>>>> partition directories that are children of the given path. (i.e. if
>>>> path="/my/data/x=1" then x=1 will no longer be considered a
>>>> partition but only children of x=1.) This behavior can be
>>>> overridden by manually specifying the basePath that partitioning
>>>> discovery should start with (SPARK-11678
>>>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>>>> - When casting a value of an integral type to timestamp (e.g.
>>>> casting a long value to timestamp), the value is treated as being in
>>>> seconds instead of milliseconds (SPARK-11724
>>>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>>>> - With the improved query planner for queries having distinct
>>>> aggregations (SPARK-9241
>>>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>>> query having a single distinct aggregation has been changed to a more
>>>> robust version. To switch back to the plan generated by Spark 1.5's
>>>> planner, please set spark.sql.specializeSingleDistinctAggPlanning
>>>> to true (SPARK-12077
>>>> <https://issues.apache.org/jira/browse/SPARK-12077>).
>>>>
>>>>
>>>
>>
>
Re:Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Allen Zhang <al...@126.com>.
plus 1
在 2015-12-17 09:39:39,"Joseph Bradley" <jo...@databricks.com> 写道:
+1
On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <rx...@databricks.com> wrote:
+1
On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:
+1
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com> wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.
===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.
==================================================
== Major changes to help you focus your testing ==
==================================================
Notable changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
Notable changes since 1.6 RC1
Spark Streaming
SPARK-2629 trackStateByKey has been renamed to mapWithState
Spark SQL
SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
SPARK-12258 correct passing null into ScalaUDF
Notable Features Since 1.5
Spark SQL
SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
SPARK-9999 Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
SPARK-9858 Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
Spark Streaming
API Updates
SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
SPARK-6328 Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
UI Improvements
Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
Made output operations visible in the streaming tab as progress bars.
MLlib
New algorithms/models
SPARK-8518 Survival analysis - Log-linear model for survival analysis
SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
ML Pipelines
SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
SPARK-9681 Feature interactions in R formula - Interaction operator ":" in R formula
Python API - Many improvements to Python API to approach feature parity
Misc improvements
SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
SPARK-11337 Testable example code - Automated testing for code in user guide examples
Deprecations
In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.
Changes of behavior
spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Joseph Bradley <jo...@databricks.com>.
+1
On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin <rx...@databricks.com> wrote:
> +1
>
>
> On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> +1
>>
>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc3
>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>
>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>
>>> =======================================
>>> == How can I help test this release? ==
>>> =======================================
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ================================================
>>> == What justifies a -1 vote for this release? ==
>>> ================================================
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===============================================================
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===============================================================
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==================================================
>>> == Major changes to help you focus your testing ==
>>> ==================================================
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>>> trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>>> bugs in eviction of storage memory by execution.
>>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>>> passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>>> Performance - Improve Parquet scan performance when using flat
>>> schemas.
>>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>>> Session Management - Isolated devault database (i.e USE mydb) even
>>> on shared clusters.
>>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>>> API - A type-safe API (similar to RDDs) that performs many
>>> operations on serialized binary data and code generation (i.e. Project
>>> Tungsten).
>>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>>> Memory Management - Shared memory for execution and caching instead
>>> of exclusive division of the regions.
>>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>>> Queries on Files - Concise syntax for running SQL queries over files
>>> of any supported format without registering a table.
>>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>>> non-standard JSON files - Added options to read non-standard JSON
>>> files (e.g. single-quotes, unquoted attributes)
>>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>>> Metrics for SQL Execution - Display statistics on a peroperator
>>> basis for memory usage and spilled data size.
>>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>>> (*) expansion for StructTypes - Makes it easier to nest and unest
>>> arbitrary numbers of columns
>>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>>> Columnar Cache Performance - Significant (up to 14x) speed up when
>>> caching data that contains complex types in DataFrames or SQL.
>>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>>> null-safe joins - Joins using null-safe equality (<=>) will now
>>> execute using SortMergeJoin instead of computing a cartisian product.
>>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>>> Execution Using Off-Heap Memory - Support for configuring query
>>> execution to occur using off-heap memory to avoid GC overhead
>>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>>> API Avoid Double Filter - When implemeting a datasource with filter
>>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>>> pushed-down filter.
>>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>>> Layout of Cached Data - storing partitioning and ordering schemes in
>>> In-memory table scan, and adding distributeBy and localSort to DF API
>>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>>> query execution - Intial support for automatically selecting the
>>> number of reducers for joins and aggregations.
>>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>>> query planner for queries having distinct aggregations - Query plans
>>> of distinct aggregations are more robust when distinct columns have high
>>> cardinality.
>>>
>>> Spark Streaming
>>>
>>> - API Updates
>>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>>> improved state management - mapWithState - a DStream
>>> transformation for stateful stream processing, supercedes
>>> updateStateByKey in functionality and performance.
>>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>>> record deaggregation - Kinesis streams have been upgraded to use
>>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>>> message handler function - Allows arbitraray function to be
>>> applied to a Kinesis record in the Kinesis receiver before to customize
>>> what data is to be stored in memory.
>>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>>> Streamng Listener API - Get streaming statistics (scheduling
>>> delays, batch processing times, etc.) in streaming.
>>>
>>>
>>> - UI Improvements
>>> - Made failures visible in the streaming tab, in the timelines,
>>> batch list, and batch details page.
>>> - Made output operations visible in the streaming tab as progress
>>> bars.
>>>
>>> MLlibNew algorithms/models
>>>
>>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>>> analysis - Log-linear model for survival analysis
>>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>>> equation for least squares - Normal equation solver, providing
>>> R-like model summary statistics
>>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>>> hypothesis testing - A/B testing in the Spark Streaming framework
>>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>>> transformer
>>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>>> K-Means clustering - Fast top-down clustering variant of K-Means
>>>
>>> API improvements
>>>
>>> - ML Pipelines
>>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>>> persistence - Save/load for ML Pipelines, with partial coverage
>>> of spark.mlalgorithms
>>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>>> Pipelines
>>> - R API
>>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>>> statistics for GLMs - (Partial) R-like stats for ordinary least
>>> squares via summary(model)
>>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>>> interactions in R formula - Interaction operator ":" in R formula
>>> - Python API - Many improvements to Python API to approach feature
>>> parity
>>>
>>> Misc improvements
>>>
>>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>>> weights for GLMs - Logistic and Linear Regression can take instance
>>> weights
>>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>>> and bivariate statistics in DataFrames - Variance, stddev,
>>> correlations, etc.
>>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>>> versions - Documentation includes initial version when classes and
>>> methods were added
>>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>>> example code - Automated testing for code in user guide examples
>>>
>>> Deprecations
>>>
>>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>>> deprecated.
>>> - In spark.ml.classification.LogisticRegressionModel and
>>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>>> deprecated, in favor of the new name "coefficients." This helps
>>> disambiguate from instance (row) weights given to algorithms.
>>>
>>> Changes of behavior
>>>
>>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>>> semantics in 1.6. Previously, it was a threshold for absolute change in
>>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>>> For large errors, it uses relative error (relative to the previous error);
>>> for small errors (< 0.01), it uses absolute error.
>>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>>> default, with an option not to. This matches the behavior of the simpler
>>> Tokenizer transformer.
>>> - Spark SQL's partition discovery has been changed to only discover
>>> partition directories that are children of the given path. (i.e. if
>>> path="/my/data/x=1" then x=1 will no longer be considered a
>>> partition but only children of x=1.) This behavior can be overridden
>>> by manually specifying the basePath that partitioning discovery
>>> should start with (SPARK-11678
>>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>>> - When casting a value of an integral type to timestamp (e.g.
>>> casting a long value to timestamp), the value is treated as being in
>>> seconds instead of milliseconds (SPARK-11724
>>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>>> - With the improved query planner for queries having distinct
>>> aggregations (SPARK-9241
>>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>>> query having a single distinct aggregation has been changed to a more
>>> robust version. To switch back to the plan generated by Spark 1.5's
>>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>>> ).
>>>
>>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Reynold Xin <rx...@databricks.com>.
+1
On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:
> +1
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> =======================================
>> == How can I help test this release? ==
>> =======================================
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ================================================
>> == What justifies a -1 vote for this release? ==
>> ================================================
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===============================================================
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===============================================================
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==================================================
>> == Major changes to help you focus your testing ==
>> ==================================================
>>
>> Notable changes since 1.6 RC2
>> - SPARK_VERSION has been set correctly
>> - SPARK-12199 ML Docs are publishing correctly
>> - SPARK-12345 Mesos cluster mode has been fixed
>>
>> Notable changes since 1.6 RC1
>> Spark Streaming
>>
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>> trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
>> bugs in eviction of storage memory by execution.
>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
>> passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
>> Performance - Improve Parquet scan performance when using flat
>> schemas.
>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
>> Session Management - Isolated devault database (i.e USE mydb) even on
>> shared clusters.
>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
>> API - A type-safe API (similar to RDDs) that performs many operations
>> on serialized binary data and code generation (i.e. Project Tungsten).
>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
>> Memory Management - Shared memory for execution and caching instead
>> of exclusive division of the regions.
>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
>> Queries on Files - Concise syntax for running SQL queries over files
>> of any supported format without registering a table.
>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
>> non-standard JSON files - Added options to read non-standard JSON
>> files (e.g. single-quotes, unquoted attributes)
>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
>> Metrics for SQL Execution - Display statistics on a peroperator basis
>> for memory usage and spilled data size.
>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
>> (*) expansion for StructTypes - Makes it easier to nest and unest
>> arbitrary numbers of columns
>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
>> Columnar Cache Performance - Significant (up to 14x) speed up when
>> caching data that contains complex types in DataFrames or SQL.
>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
>> null-safe joins - Joins using null-safe equality (<=>) will now
>> execute using SortMergeJoin instead of computing a cartisian product.
>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
>> Execution Using Off-Heap Memory - Support for configuring query
>> execution to occur using off-heap memory to avoid GC overhead
>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
>> API Avoid Double Filter - When implemeting a datasource with filter
>> pushdown, developers can now tell Spark SQL to avoid double evaluating a
>> pushed-down filter.
>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
>> Layout of Cached Data - storing partitioning and ordering schemes in
>> In-memory table scan, and adding distributeBy and localSort to DF API
>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
>> query execution - Intial support for automatically selecting the
>> number of reducers for joins and aggregations.
>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
>> query planner for queries having distinct aggregations - Query plans
>> of distinct aggregations are more robust when distinct columns have high
>> cardinality.
>>
>> Spark Streaming
>>
>> - API Updates
>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
>> improved state management - mapWithState - a DStream
>> transformation for stateful stream processing, supercedes
>> updateStateByKey in functionality and performance.
>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>> record deaggregation - Kinesis streams have been upgraded to use
>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>> message handler function - Allows arbitraray function to be
>> applied to a Kinesis record in the Kinesis receiver before to customize
>> what data is to be stored in memory.
>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
>> Streamng Listener API - Get streaming statistics (scheduling
>> delays, batch processing times, etc.) in streaming.
>>
>>
>> - UI Improvements
>> - Made failures visible in the streaming tab, in the timelines,
>> batch list, and batch details page.
>> - Made output operations visible in the streaming tab as progress
>> bars.
>>
>> MLlibNew algorithms/models
>>
>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
>> analysis - Log-linear model for survival analysis
>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
>> equation for least squares - Normal equation solver, providing R-like
>> model summary statistics
>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
>> hypothesis testing - A/B testing in the Spark Streaming framework
>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>> transformer
>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
>> K-Means clustering - Fast top-down clustering variant of K-Means
>>
>> API improvements
>>
>> - ML Pipelines
>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
>> persistence - Save/load for ML Pipelines, with partial coverage of
>> spark.mlalgorithms
>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
>> in ML Pipelines - API for Latent Dirichlet Allocation in ML
>> Pipelines
>> - R API
>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
>> statistics for GLMs - (Partial) R-like stats for ordinary least
>> squares via summary(model)
>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>> interactions in R formula - Interaction operator ":" in R formula
>> - Python API - Many improvements to Python API to approach feature
>> parity
>>
>> Misc improvements
>>
>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>> weights for GLMs - Logistic and Linear Regression can take instance
>> weights
>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>> and bivariate statistics in DataFrames - Variance, stddev,
>> correlations, etc.
>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>> data source - LIBSVM as a SQL data sourceDocumentation improvements
>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
>> versions - Documentation includes initial version when classes and
>> methods were added
>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>> example code - Automated testing for code in user guide examples
>>
>> Deprecations
>>
>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
>> deprecated.
>> - In spark.ml.classification.LogisticRegressionModel and
>> spark.ml.regression.LinearRegressionModel, the "weights" field has been
>> deprecated, in favor of the new name "coefficients." This helps
>> disambiguate from instance (row) weights given to algorithms.
>>
>> Changes of behavior
>>
>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
>> semantics in 1.6. Previously, it was a threshold for absolute change in
>> error. Now, it resembles the behavior of GradientDescent convergenceTol:
>> For large errors, it uses relative error (relative to the previous error);
>> for small errors (< 0.01), it uses absolute error.
>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
>> strings to lowercase before tokenizing. Now, it converts to lowercase by
>> default, with an option not to. This matches the behavior of the simpler
>> Tokenizer transformer.
>> - Spark SQL's partition discovery has been changed to only discover
>> partition directories that are children of the given path. (i.e. if
>> path="/my/data/x=1" then x=1 will no longer be considered a partition
>> but only children of x=1.) This behavior can be overridden by
>> manually specifying the basePath that partitioning discovery should
>> start with (SPARK-11678
>> <https://issues.apache.org/jira/browse/SPARK-11678>).
>> - When casting a value of an integral type to timestamp (e.g. casting
>> a long value to timestamp), the value is treated as being in seconds
>> instead of milliseconds (SPARK-11724
>> <https://issues.apache.org/jira/browse/SPARK-11724>).
>> - With the improved query planner for queries having distinct
>> aggregations (SPARK-9241
>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>> query having a single distinct aggregation has been changed to a more
>> robust version. To switch back to the plan generated by Spark 1.5's
>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>
>> ).
>>
>>
>
Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Posted by Mark Hamstra <ma...@clearstorydata.com>.
+1
On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> <https://github.com/apache/spark/tree/v1.6.0-rc3>*
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
> trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix
> bugs in eviction of storage memory by execution.
> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
> passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
> Performance - Improve Parquet scan performance when using flat schemas.
> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
> Session Management - Isolated devault database (i.e USE mydb) even on
> shared clusters.
> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
> API - A type-safe API (similar to RDDs) that performs many operations
> on serialized binary data and code generation (i.e. Project Tungsten).
> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
> Memory Management - Shared memory for execution and caching instead of
> exclusive division of the regions.
> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
> Queries on Files - Concise syntax for running SQL queries over files
> of any supported format without registering a table.
> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
> non-standard JSON files - Added options to read non-standard JSON
> files (e.g. single-quotes, unquoted attributes)
> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator
> Metrics for SQL Execution - Display statistics on a peroperator basis
> for memory usage and spilled data size.
> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
> (*) expansion for StructTypes - Makes it easier to nest and unest
> arbitrary numbers of columns
> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
> Columnar Cache Performance - Significant (up to 14x) speed up when
> caching data that contains complex types in DataFrames or SQL.
> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
> null-safe joins - Joins using null-safe equality (<=>) will now
> execute using SortMergeJoin instead of computing a cartisian product.
> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
> Execution Using Off-Heap Memory - Support for configuring query
> execution to occur using off-heap memory to avoid GC overhead
> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
> API Avoid Double Filter - When implemeting a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
> Layout of Cached Data - storing partitioning and ordering schemes in
> In-memory table scan, and adding distributeBy and localSort to DF API
> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
> query execution - Intial support for automatically selecting the
> number of reducers for joins and aggregations.
> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved
> query planner for queries having distinct aggregations - Query plans
> of distinct aggregations are more robust when distinct columns have high
> cardinality.
>
> Spark Streaming
>
> - API Updates
> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New
> improved state management - mapWithState - a DStream transformation
> for stateful stream processing, supercedes updateStateByKey in
> functionality and performance.
> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
> record deaggregation - Kinesis streams have been upgraded to use
> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
> message handler function - Allows arbitraray function to be applied
> to a Kinesis record in the Kinesis receiver before to customize what data
> is to be stored in memory.
> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python
> Streamng Listener API - Get streaming statistics (scheduling
> delays, batch processing times, etc.) in streaming.
>
>
> - UI Improvements
> - Made failures visible in the streaming tab, in the timelines,
> batch list, and batch details page.
> - Made output operations visible in the streaming tab as progress
> bars.
>
> MLlibNew algorithms/models
>
> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival
> analysis - Log-linear model for survival analysis
> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal
> equation for least squares - Normal equation solver, providing R-like
> model summary statistics
> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online
> hypothesis testing - A/B testing in the Spark Streaming framework
> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
> transformer
> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
> K-Means clustering - Fast top-down clustering variant of K-Means
>
> API improvements
>
> - ML Pipelines
> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
> persistence - Save/load for ML Pipelines, with partial coverage of
> spark.mlalgorithms
> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA
> in ML Pipelines - API for Latent Dirichlet Allocation in ML
> Pipelines
> - R API
> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like
> statistics for GLMs - (Partial) R-like stats for ordinary least
> squares via summary(model)
> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature
> interactions in R formula - Interaction operator ":" in R formula
> - Python API - Many improvements to Python API to approach feature
> parity
>
> Misc improvements
>
> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance
> weights for GLMs - Logistic and Linear Regression can take instance
> weights
> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
> and bivariate statistics in DataFrames - Variance, stddev,
> correlations, etc.
> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
> data source - LIBSVM as a SQL data sourceDocumentation improvements
> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since
> versions - Documentation includes initial version when classes and
> methods were added
> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
> example code - Automated testing for code in user guide examples
>
> Deprecations
>
> - In spark.mllib.clustering.KMeans, the "runs" parameter has been
> deprecated.
> - In spark.ml.classification.LogisticRegressionModel and
> spark.ml.regression.LinearRegressionModel, the "weights" field has been
> deprecated, in favor of the new name "coefficients." This helps
> disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
> - spark.mllib.tree.GradientBoostedTrees validationTol has changed
> semantics in 1.6. Previously, it was a threshold for absolute change in
> error. Now, it resembles the behavior of GradientDescent convergenceTol:
> For large errors, it uses relative error (relative to the previous error);
> for small errors (< 0.01), it uses absolute error.
> - spark.ml.feature.RegexTokenizer: Previously, it did not convert
> strings to lowercase before tokenizing. Now, it converts to lowercase by
> default, with an option not to. This matches the behavior of the simpler
> Tokenizer transformer.
> - Spark SQL's partition discovery has been changed to only discover
> partition directories that are children of the given path. (i.e. if
> path="/my/data/x=1" then x=1 will no longer be considered a partition
> but only children of x=1.) This behavior can be overridden by manually
> specifying the basePath that partitioning discovery should start with (
> SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
> - When casting a value of an integral type to timestamp (e.g. casting
> a long value to timestamp), the value is treated as being in seconds
> instead of milliseconds (SPARK-11724
> <https://issues.apache.org/jira/browse/SPARK-11724>).
> - With the improved query planner for queries having distinct
> aggregations (SPARK-9241
> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
> query having a single distinct aggregation has been changed to a more
> robust version. To switch back to the plan generated by Spark 1.5's
> planner, please set spark.sql.specializeSingleDistinctAggPlanning to
> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>