You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Ricky <> on 2015/12/20 13:32:39 UTC

回复: [VOTE] Release Apache Spark 1.6.0 (RC3)

SizeBasedRollingPolicy print too many               log  when spark.executor.logs.rolling.strategy is size , shouldRollover use logInfo method:
  def shouldRollover(bytesToBeWritten: Long): Boolean = {
    logInfo(s"$bytesToBeWritten + $bytesWrittenSinceRollover > $rolloverSizeBytes")
    bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes

  spark.executor.logs.rolling.strategy size
  spark.executor.logs.rolling.maxSize 134217728
  spark.executor.logs.rolling.maxRetainedFiles 8

 Can use logdebug instead of loginfo ?

                                                                                                                                                                                                                                            Best Regards                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                           Ricky Yang                                                                                                                                                               
                      ------------------ 原始邮件 ------------------         
                                                            发件人:                                   "Jeff Zhang";<>;             
                                                   发送时间:                                   2015年12月20日(星期天) 下午3:44             
                                                   收件人:                                   "Luciano Resende"<>;                                  
                                                               抄送:                                           "Michael Armbrust"<>; ""<>;                                          
                                                                           主题:                                                   Re: [VOTE] Release Apache Spark 1.6.0 (RC3)                     
                                      +1 (non-binding)                                              
                                              All the test passed, and run it on HDP 2.3.2 sandbox successfully.                     
                                              On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende                                                      <                                                                                                              >                                                  wrote:                         
                                                                                                                            +1 (non-binding)                                     
                                 Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.                                 
                                                                                                               On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust                                                                                      <                                                                                                                                                                              >                                                                                  wrote:                                         
                                                                                                                                                                                                                                                                                                                                                                                            Please vote on releasing the following candidate as Apache Spark version 1.6.0!                                                                                                              
                                                             The vote is open until Saturday, December 19, 2015 at 18                                                                                                                              :00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.                                                                                                                          
                                                                                                                                                                           [ ] +1 Release this package as Apache Spark 1.6.0                                                                                                              
                                                                                                                                                                           [ ] -1 Do not release this package because ...                                                                                                              
                                                                                                                                                                           To learn more about Apache Spark, please see                                                                                                                                                                                                                                                                                                
                                                                                                                                                                           The tag to be voted on is                                                                                                                                                                                                    v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)                                                                                                                                                                                                                                            
                                                                                                                                                                           The release files, including signatures, digests, etc. can be found at:                                                                                                              
                                                                                                                                                                           Release artifacts are signed with the following key:                                                                                                              
                                                                                                                                                                                                                                            The staging repository for this release can be found at:                                                                                                                      
                                                                                                                                                                           The test repository (versioned as v1.6.0-rc3) for this release can be found at:                                                                                                              
                                                                                                                                                                           The documentation corresponding to this release can be found at:                                                                                                              
                                                                                                                                                                           == How can I help test this release? ==                                                                                                              
                                                                                                                                                                           If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.                                                                                                              
                                                                                                                                                                           == What justifies a -1 vote for this release? ==                                                                                                              
                                                                                                                                                                           This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.                                                                                                              
                                                                                                                                                                           == What should happen to JIRA tickets still targeting 1.6.0? ==                                                                                                              
                                                                                                                                                                           1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.                                                                                                              
                                                                                                                                                                           2. New features for non-alpha-modules should target 1.7+.                                                                                                              
                                                                                                                                                                           3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.                                                                                                              
                                                                                                                                                                           == Major changes to help you focus your testing ==                                                                                                              
                                                                                                                              Notable changes since 1.6 RC2                                                                                                                      
                                                             - SPARK_VERSION has been set correctly                                                             
                                                             - SPARK-12199 ML Docs are publishing correctly                                                             
                                                             - SPARK-12345 Mesos cluster mode has been fixed                                                                                                                                                                                                                                                
                                                                                                                              Notable changes since 1.6 RC1                                                                                                                          
                                                                                                                              Spark Streaming                                                                                                                      
                                                                                                                                                                                                               SPARK-2629                                                                                                                                                                                                                                                                                          trackStateByKey                                                                                                                                           has been renamed to                                                                                                                                               mapWithState                                                                                                                                                                                                   
                                                                                                                              Spark SQL                                                                                                                      
                                                                                                                                                                                                               SPARK-12165                                                                                                                                                                                                                                                                                         SPARK-12189                                                                                                                                           Fix bugs in eviction of storage memory by execution.                                                                                                                              
                                                                                                                                                                                                               SPARK-12258                                                                                                                                           correct passing null into ScalaUDF                                                                                                                              
                                                                                                                              Notable Features Since 1.5                                                                                                                      
                                                                                                                              Spark SQL                                                                                                                      
                                                                                                                                                                                                               SPARK-11787                                                                                                                                                                                                                                                                                         Parquet Performance                                                                                                                                           - Improve Parquet scan performance when using flat schemas.                                                                                                                              
                                                                                                                                                                                                               SPARK-10810                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Session                                                                                                                                                                                                                                                                                                 Management                                                                                                                                                                                                                                - Isolated devault database (i.e                                                                                                                                                                                                                                                                                             USE mydb                                                                                                                                                                                                                                                                                                ) even on shared clusters.                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                               SPARK-9999                                                                                                                                                                                                                                                                                          Dataset API                                                                                                                                           - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).                                                                                                                              
                                                                                                                                                                                                               SPARK-10000                                                                                                                                                                                                                                                                                         Unified Memory Management                                                                                                                                           - Shared memory for execution and caching instead of exclusive division of the regions.                                                                                                                              
                                                                                                                                                                                                               SPARK-11197                                                                                                                                                                                                                                                                                         SQL Queries on Files                                                                                                                                           - Concise syntax for running SQL queries over files of any supported format without registering a table.                                                                                                                              
                                                                                                                                                                                                               SPARK-11745                                                                                                                                                                                                                                                                                         Reading non-standard JSON files                                                                                                                                           - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)                                                                                                                              
                                                                                                                                                                                                               SPARK-10412                                                                                                                                                                                                                                                                                         Per-operator Metrics for SQL Execution                                                                                                                                           - Display statistics on a peroperator basis for memory usage and spilled data size.                                                                                                                              
                                                                                                                                                                                                               SPARK-11329                                                                                                                                                                                                                                                                                         Star (*) expansion for StructTypes                                                                                                                                           - Makes it easier to nest and unest arbitrary numbers of columns                                                                                                                              
                                                                                                                                                                                                               SPARK-10917                                                                                                                                          ,                                                                                                                                               SPARK-11149                                                                                                                                                                                                                                                                                         In-memory Columnar Cache Performance                                                                                                                                           - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.                                                                                                                              
                                                                                                                                                                                                               SPARK-11111                                                                                                                                                                                                                                                                                         Fast null-safe joins                                                                                                                                           - Joins using null-safe equality (                                                                                                                                              <=>                                                                                                                                          ) will now execute using SortMergeJoin instead of computing a cartisian product.                                                                                                                              
                                                                                                                                                                                                               SPARK-11389                                                                                                                                                                                                                                                                                         SQL Execution Using Off-Heap Memory                                                                                                                                           - Support for configuring query execution to occur using off-heap memory to avoid GC overhead                                                                                                                              
                                                                                                                                                                                                               SPARK-10978                                                                                                                                                                                                                                                                                         Datasource API Avoid Double Filter                                                                                                                                           - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.                                                                                                                              
                                                                                                                                                                                                               SPARK-4849                                                                                                                                                                                                                                                                                          Advanced Layout of Cached Data                                                                                                                                           - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API                                                                                                                              
                                                                                                                                                                                                               SPARK-9858                                                                                                                                                                                                                                                                                          Adaptive query execution                                                                                                                                           - Intial support for automatically selecting the number of reducers for joins and aggregations.                                                                                                                              
                                                                                                                                                                                                               SPARK-9241                                                                                                                                                                                                                                                                                          Improved query planner for queries having distinct aggregations                                                                                                                                           - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.                                                                                                                              
                                                                                                                              Spark Streaming                                                                                                                      
                                                                                                                                                                                                               API Updates                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                       SPARK-2629                                                                                                                                                                                                                                                                                                                          New improved state management                                                                                                                                                           -                                                                                                                                                               mapWithState                                                                                                                                                           - a DStream transformation for stateful stream processing, supercedes                                                                                                                                                               updateStateByKey                                                                                                                                                           in functionality and performance.                                                                                                                                              
                                                                                                                                                                                                                                       SPARK-11198                                                                                                                                                                                                                                                                                                                         Kinesis record deaggregation                                                                                                                                                           - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.                                                                                                                                              
                                                                                                                                                                                                                                       SPARK-10891                                                                                                                                                                                                                                                                                                                         Kinesis message handler function                                                                                                                                                           - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.                                                                                                                                              
                                                                                                                                                                                                                                       SPARK-6328                                                                                                                                                                                                                                                                                                                          Python Streamng Listener API                                                                                                                                                           - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.                                                                                                                                              
                                                                                                                                                                                                               UI Improvements                                                                                                                                                                                                                                                                            
                                                                                                                                                      Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.                                                                                                                                              
                                                                                                                                                      Made output operations visible in the streaming tab as progress bars.                                                                                                                                              
                                                                                                                              New algorithms/models                                                                                                                      
                                                                                                                                                                                                               SPARK-8518                                                                                                                                                                                                                                                                                          Survival analysis                                                                                                                                           - Log-linear model for survival analysis                                                                                                                              
                                                                                                                                                                                                               SPARK-9834                                                                                                                                                                                                                                                                                          Normal equation for least squares                                                                                                                                           - Normal equation solver, providing R-like model summary statistics                                                                                                                              
                                                                                                                                                                                                               SPARK-3147                                                                                                                                                                                                                                                                                          Online hypothesis testing                                                                                                                                           - A/B testing in the Spark Streaming framework                                                                                                                              
                                                                                                                                                                                                               SPARK-9930                                                                                                                                                                                                                                                                                          New feature transformers                                                                                                                                           - ChiSqSelector, QuantileDiscretizer, SQL transformer                                                                                                                              
                                                                                                                                                                                                               SPARK-6517                                                                                                                                                                                                                                                                                          Bisecting K-Means clustering                                                                                                                                           - Fast top-down clustering variant of K-Means                                                                                                                              
                                                                                                                              API improvements                                                                                                                      
                                                                                                                                                                                                               ML Pipelines                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                       SPARK-6725                                                                                                                                                                                                                                                                                                                          Pipeline persistence                                                                                                                                                           - Save/load for ML Pipelines, with partial coverage of                                                                                                                                                                                                                                                                                                               algorithms                                                                                                                                              
                                                                                                                                                                                                                                       SPARK-5565                                                                                                                                                                                                                                                                                                                          LDA in ML Pipelines                                                                                                                                                           - API for Latent Dirichlet Allocation in ML Pipelines                                                                                                                                              
                                                                                                                                                                                                               R API                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                       SPARK-9836                                                                                                                                                                                                                                                                                                                          R-like statistics for GLMs                                                                                                                                                           - (Partial) R-like stats for ordinary least squares via summary(model)                                                                                                                                              
                                                                                                                                                                                                                                       SPARK-9681                                                                                                                                                                                                                                                                                                                          Feature interactions in R formula                                                                                                                                                           - Interaction operator ":" in R formula                                                                                                                                              
                                                                                                                                                                                                               Python API                                                                                                                                           - Many improvements to Python API to approach feature parity                                                                                                                              
                                                                                                                              Misc improvements                                                                                                                      
                                                                                                                                                                                                               SPARK-7685                                                                                                                                           ,                                                                                                                                               SPARK-9642                                                                                                                                                                                                                                                                                          Instance weights for GLMs                                                                                                                                           - Logistic and Linear Regression can take instance weights                                                                                                                              
                                                                                                                                                                                                               SPARK-10384                                                                                                                                          ,                                                                                                                                               SPARK-10385                                                                                                                                                                                                                                                                                         Univariate and bivariate statistics in DataFrames                                                                                                                                           - Variance, stddev, correlations, etc.                                                                                                                              
                                                                                                                                                                                                               SPARK-10117                                                                                                                                                                                                                                                                                         LIBSVM data source                                                                                                                                           - LIBSVM as a SQL data source                                                                                                                                  
                                                                                                                                              Documentation improvements                                                                                                                                      
                                                                                                                                                                                                               SPARK-7751                                                                                                                                                                                                                                                                                          @since versions                                                                                                                                           - Documentation includes initial version when classes and methods were added                                                                                                                              
                                                                                                                                                                                                               SPARK-11337                                                                                                                                                                                                                                                                                         Testable example code                                                                                                                                           - Automated testing for code in user guide examples                                                                                                                              
                                                                                                                                      In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.                                                                                                                              
                                                                                                                                      In and, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.                                                                                                                              
                                                                                                                              Changes of behavior                                                                                                                      
                                                                                                                                      spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.                                                                                                                              
                                                                                                                             Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.                                                                                                                              
                                                                                                                                      Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if                                                                                                                                               path="/my/data/x=1"                                                                                                                                           then                                                                                                                                               x=1                                                                                                                                           will no longer be considered a partition but only children of                                                                                                                                               x=1                                                                                                                                          .) This behavior can be overridden by manually specifying the                                                                                                                                               basePath                                                                                                                                           that partitioning discovery should start with (                                                                                                                                              SPARK-11678                                                                                                                                          ).                                                                                                                              
                                                                                                                                      When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (                                                                                                                                              SPARK-11724                                                                                                                                          ).                                                                                                                              
                                                                                                                                      With the improved query planner for queries having distinct aggregations (                                                                                                                                              SPARK-9241                                                                                                                                          ), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set                                                                                                                                               spark.sql.specializeSingleDistinctAggPlanning                                                                                                                                           to                                                                                                                                               true                                                                                                                                           (                                                                                                                                              SPARK-12077                                                                                                                                          ).                                                                                                                              
                                                                                      Luciano Resende                                             
                                              Best Regards                         
                         Jeff Zhang