You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jameszhouyi <gi...@git.apache.org> on 2014/09/11 09:51:11 UTC

[GitHub] spark pull request: Branch 1.1

GitHub user jameszhouyi opened a pull request:

    https://github.com/apache/spark/pull/2353

    Branch 1.1

    Symptom:
    Run ./dev/run-tests and dump outputs as following:
    SBT_MAVEN_PROFILES_ARGS="-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl"
    [Warn] Java 8 tests will not run because JDK version is < 1.8.
    =========================================================================
    Running Apache RAT checks
    =========================================================================
    RAT checks passed.
    =========================================================================
    Running Scala style checks
    =========================================================================
    Scalastyle checks failed at following occurrences:
    [error] Expected ID character
    [error] Not a valid command: yarn-alpha
    [error] Expected project ID
    [error] Expected configuration
    [error] Expected ':' (if selecting a configuration)
    [error] Expected key
    [error] Not a valid key: yarn-alpha
    [error] yarn-alpha/scalastyle
    [error] ^
    
    Possible Cause:
    I checked the dev/scalastyle, found that there are 2 parameters 'yarn-alpha/scalastyle' and 'yarn/scalastyle' separately,like
    echo -e "q\n" | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 yarn-alpha/scalastyle \
    >> scalastyle.txt
    
    echo -e "q\n" | sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 yarn/scalastyle \
    >> scalastyle.txt
    
    From above error message, sbt seems to complain them due to '/' separator. So it can be run through after I manually modified original ones to 'yarn-alpha:scalastyle' and 'yarn:scalastyle'..

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-1.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2353
    
----
commit ee7d2cc1a935da62de968799c0ecc6f98e43361a
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-14T00:37:55Z

    [SPARK-2650][SQL] More precise initial buffer size estimation for in-memory column buffer
    
    This is a follow up of #1880.
    
    Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.
    
    Author: Cheng Lian <li...@gmail.com>
    
    Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:
    
    d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
    
    (cherry picked from commit 376a82e196e102ef49b9722e8be0b01ac5890a8b)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit e8e7f17e1e6d84268421dbfa315850b07a8a4c15
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-14T00:40:59Z

    [SPARK-2935][SQL]Fix parquet predicate push down bug
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #1863 from marmbrus/parquetPredicates and squashes the following commits:
    
    10ad202 [Michael Armbrust] left <=> right
    f249158 [Michael Armbrust] quiet parquet tests.
    802da5b [Michael Armbrust] Add test case.
    eab2eda [Michael Armbrust] Fix parquet predicate push down bug
    
    (cherry picked from commit 9fde1ff5fc114b5edb755ed40944607419b62184)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit b5b632c8cd02fd1e65ebd22216d20ec76715fc5d
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2014-08-14T00:42:38Z

    [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled
    
    Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
    
    Closes #1891 from sarutak/SPARK-2970 and squashes the following commits:
    
    4a2d2fe [Kousuke Saruta] Modified comment style
    8bd833c [Kousuke Saruta] Modified style
    6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
    
    (cherry picked from commit 905dc4b405e679feb145f5e6b35e952db2442e0d)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit a8d2649719b3d8fdb1eed29ef179a6a896b3e37a
Author: guowei <gu...@upyoo.com>
Date:   2014-08-14T00:45:24Z

    [SPARK-2986] [SQL] fixed: setting properties does not effect
    
    it seems that set command does not run by SparkSQLDriver. it runs on hive api.
    user can not change reduce number by setting spark.sql.shuffle.partitions
    
    but i think setting hive properties seems just a role to spark sql.
    
    Author: guowei <gu...@upyoo.com>
    
    Closes #1904 from guowei2/temp-branch and squashes the following commits:
    
    7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
    
    (cherry picked from commit 63d6777737ca8559d4344d1661500b8ad868bb47)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit c6cb55a784ba8f9e5c4e7aadcc3ec9dce24f49ee
Author: Patrick Wendell <pw...@gmail.com>
Date:   2014-08-14T01:08:38Z

    SPARK-3020: Print completed indices rather than tasks in web UI
    
    Author: Patrick Wendell <pw...@gmail.com>
    
    Closes #1933 from pwendell/speculation and squashes the following commits:
    
    33a3473 [Patrick Wendell] Use OpenHashSet
    8ce2ff0 [Patrick Wendell] SPARK-3020: Print completed indices rather than tasks in web UI
    
    (cherry picked from commit 0c7b452904fe6b5a966a66b956369123d8a9dd4b)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit dcd99c3e63f8a5154f904ae57e945e8caaade649
Author: Masayoshi TSUZUKI <ts...@oss.nttdata.co.jp>
Date:   2014-08-14T05:17:07Z

    [SPARK-3006] Failed to execute spark-shell in Windows OS
    
    Modified the order of the options and arguments in spark-shell.cmd
    
    Author: Masayoshi TSUZUKI <ts...@oss.nttdata.co.jp>
    
    Closes #1918 from tsudukim/feature/SPARK-3006 and squashes the following commits:
    
    8bba494 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
    1a32410 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
    
    (cherry picked from commit 9497b12d429cf9d075807896637e40e205175203)
    Signed-off-by: Andrew Or <an...@gmail.com>

commit bf7c6e198822d155c23cfaa7219c36e5db8d1eeb
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-14T06:24:23Z

    [Docs] Add missing <code> tags (minor)
    
    These configs looked inconsistent from the rest.
    
    Author: Andrew Or <an...@gmail.com>
    
    Closes #1936 from andrewor14/docs-code and squashes the following commits:
    
    15f578a [Andrew Or] Add <code> tag
    
    (cherry picked from commit e4245656438d00714ebd59e89c4de3fdaae83494)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit 1baf06f4e6a2c4767ad6107559396c7680085235
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-14T06:53:44Z

    [SPARK-2995][MLLIB] add ALS.setIntermediateRDDStorageLevel
    
    As mentioned in SPARK-2465, using `MEMORY_AND_DISK_SER` for user/product in/out links together with `spark.rdd.compress=true` can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets.
    
    Another option for the method name is `setIntermediateRDDStorageLevel`.
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #1913 from mengxr/als-storagelevel and squashes the following commits:
    
    d942017 [Xiangrui Meng] rename to setIntermediateRDDStorageLevel
    7550029 [Xiangrui Meng] add ALS.setIntermediateDataStorageLevel
    
    (cherry picked from commit 69a57a18ee35af1cc5a00b67a80837ea317cd330)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 0cb2b82e0ef903dd99c589928bc17650037f25c5
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-08-14T08:37:38Z

    [SPARK-3029] Disable local execution of Spark jobs by default
    
    Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.
    
    Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.
    
    This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.
    
    Author: Aaron Davidson <aa...@databricks.com>
    
    Closes #1321 from aarondav/allowlocal and squashes the following commits:
    
    136b253 [Aaron Davidson] Fix DAGSchedulerSuite
    5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
    
    (cherry picked from commit d069c5d9d2f6ce06389ca2ddf0b3ae4db72c5797)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit af809de77b5f939320c20d98d6c6dd98fcfd55a7
Author: Graham Dennis <gr...@gmail.com>
Date:   2014-08-14T09:24:18Z

    SPARK-2893: Do not swallow Exceptions when running a custom kryo registrator
    
    The previous behaviour of swallowing ClassNotFound exceptions when running a custom Kryo registrator could lead to difficult to debug problems later on at serialisation / deserialisation time, see SPARK-2878.  Instead it is better to fail fast.
    
    Added test case.
    
    Author: Graham Dennis <gr...@gmail.com>
    
    Closes #1827 from GrahamDennis/feature/spark-2893 and squashes the following commits:
    
    fbe4cb6 [Graham Dennis] [SPARK-2878]: Update the test case to match the updated exception message
    65e53c5 [Graham Dennis] [SPARK-2893]: Improve message when a spark.kryo.registrator fails.
    f480d85 [Graham Dennis] [SPARK-2893] Fix typo.
    b59d2c2 [Graham Dennis] SPARK-2893: Do not swallow Exceptions when running a custom spark.kryo.registrator
    
    (cherry picked from commit 6b8de0e36c7548046c3b8a57f2c8e7e788dde8cc)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit 221c84e6ab631a137165e0e6b41d4d10b018d2b6
Author: Chia-Yung Su <ch...@appier.com>
Date:   2014-08-14T17:43:08Z

    [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
    
    Author: Chia-Yung Su <ch...@appier.com>
    
    Closes #1924 from joesu/bugfix-spark3011 and squashes the following commits:
    
    c7e44f2 [Chia-Yung Su] match syntax
    f8fc32a [Chia-Yung Su] filter out tmp dir
    
    (cherry picked from commit 078f3fbda860e2f5de34153c55dfc3fecb4256e9)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit de501e169f24e4573747aec85b7651c98633c028
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-14T17:46:33Z

    [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns
    
    This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2927
    
    Author: Yin Huai <hu...@cse.ohio-state.edu>
    
    Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits:
    
    689ffa9 [Yin Huai] Add missing "=".
    80827de [Yin Huai] Unit test.
    1765ca4 [Yin Huai] Use .toBoolean.
    9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
    5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
    
    (cherry picked from commit add75d4831fdc35712bf8b737574ea0bc677c37c)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 850abaa36043104e5f09bf2754d1ae3f9ce86e3d
Author: Ahir Reddy <ah...@gmail.com>
Date:   2014-08-14T17:48:52Z

    [SQL] Python JsonRDD UTF8 Encoding Fix
    
    Only encode unicode objects to UTF-8, and not strings
    
    Author: Ahir Reddy <ah...@gmail.com>
    
    Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:
    
    ca4e9ba [Ahir Reddy] Encoding Fix
    
    (cherry picked from commit fde692b361773110c262abe219e7c8128bd76419)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit df25acdf447bfac9c41440f49bd3bbe1c5d34696
Author: wangfei <wa...@126.com>
Date:   2014-08-14T17:55:51Z

    [SPARK-2925] [sql]fix spark-sql and start-thriftserver shell bugs when set --driver-java-options
    
    https://issues.apache.org/jira/browse/SPARK-2925
    
    Run cmd like this will get the error
    bin/spark-sql --driver-java-options '-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,address=8788,server=y,suspend=y'
    
    Error: Unrecognized option '-Xnoagent'.
    Run with --help for usage help or --verbose for debug output
    
    Author: wangfei <wa...@126.com>
    Author: wangfei <wa...@huawei.com>
    
    Closes #1851 from scwf/patch-2 and squashes the following commits:
    
    516554d [wangfei] quote variables to fix this issue
    8bd40f2 [wangfei] quote variables to fix this problem
    e6d79e3 [wangfei] fix start-thriftserver bug when set driver-java-options
    948395d [wangfei] fix spark-sql error when set --driver-java-options
    
    (cherry picked from commit 267fdffe2743bc2dc706c8ac8af0ae33a358a5d3)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit a3dc54fa11c5323ec191df52c06443d3f96956d4
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-14T18:22:41Z

    Minor cleanup of metrics.Source
    
    - Added override.
    - Marked some variables as private.
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1943 from rxin/metricsSource and squashes the following commits:
    
    fbfa943 [Reynold Xin] Minor cleanup of metrics.Source. - Added override. - Marked some variables as private.
    
    (cherry picked from commit eaeb0f76fa0f103c7db0f3975cb8562715410973)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit dc8ef9387247e191406d8ff2df7af27bba007f53
Author: DB Tsai <db...@alpinenow.com>
Date:   2014-08-14T18:56:13Z

    [SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number
    
    In theory, the scale of your inputs are irrelevant to logistic regression.
    You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
    adjust accordingly. It will be 1E-6 times smaller than the original β1, due
    to the invariance property of MLEs.
    
    However, during the optimization process, the convergence (rate)
    depends on the condition number of the training dataset. Scaling
    the variables often reduces this condition number, thus improving
    the convergence rate.
    
    Without reducing the condition number, some training datasets
    mixing the columns with different scales may not be able to converge.
    
    GLMNET and LIBSVM packages perform the scaling to reduce
    the condition number, and return the weights in the original scale.
    See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    
    Here, if useFeatureScaling is enabled, we will standardize the training
    features by dividing the variance of each column (without subtracting
    the mean to densify the sparse vector), and train the model in the
    scaled space. Then we transform the coefficients from the scaled space
    to the original scale as GLMNET and LIBSVM do.
    
    Currently, it's only enabled in LogisticRegressionWithLBFGS.
    
    Author: DB Tsai <db...@alpinenow.com>
    
    Closes #1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
    
    f19fc02 [DB Tsai] Added more comments
    1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS
    
    (cherry picked from commit 96221067572e5955af1a7710b0cca33a73db4bd5)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit c39a3f337cfed86b3c75d90f33319498ed9a3255
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-14T20:00:21Z

    Revert  [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
    
    Reverts #1924 due to build failures with hadoop 0.23.
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #1949 from marmbrus/revert1924 and squashes the following commits:
    
    6bff940 [Michael Armbrust] Revert "[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile"
    
    (cherry picked from commit a7f8a4f5ee757450ce8d4028021441435081cf53)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit f5d9176fba934fa1f440d14d1ac7cd6f149434c4
Author: Jacek Lewandowski <le...@gmail.com>
Date:   2014-08-14T22:01:39Z

    SPARK-3009: Reverted readObject method in ApplicationInfo so that Applic...
    
    ...ationInfo is initialized properly after deserialization
    
    Author: Jacek Lewandowski <le...@gmail.com>
    
    Closes #1947 from jacek-lewandowski/master and squashes the following commits:
    
    713b2f1 [Jacek Lewandowski] SPARK-3009: Reverted readObject method in ApplicationInfo so that ApplicationInfo is initialized properly after deserialization
    
    (cherry picked from commit a75bc7a21db07258913d038bf604c0a3c1e55b46)
    Signed-off-by: Andrew Or <an...@gmail.com>

commit 475a35ba4f3a641a775bb4a71481bf95e6dd3509
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-14T23:27:11Z

    Make dev/mima runnable on Mac OS X.
    
    Mac OS X's find is from the BSD variant that doesn't have -printf option.
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1953 from rxin/mima and squashes the following commits:
    
    e284afe [Reynold Xin] Make dev/mima runnable on Mac OS X.
    
    (cherry picked from commit fa5a08e67d1086045ac249c2090c5e4d0a17b828)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit f99e4fc80615a1e0861359ab1ebc2e8335c7a022
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-15T01:37:02Z

    [SPARK-3027] TaskContext: tighten visibility and provide Java friendly callback API
    
    Note this also passes the TaskContext itself to the TaskCompletionListener. In the future we can mark TaskContext with the exception object if exception occurs during task execution.
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1938 from rxin/TaskContext and squashes the following commits:
    
    145de43 [Reynold Xin] Added JavaTaskCompletionListenerImpl for Java API friendly guarantee.
    f435ea5 [Reynold Xin] Added license header for TaskCompletionListener.
    dc4ed27 [Reynold Xin] [SPARK-3027] TaskContext: tighten the visibility and provide Java friendly callback API
    
    (cherry picked from commit 655699f8b7156e8216431393436368e80626cdb2)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit 72e730e9828bb3d88c69a36a241c2e332fca5629
Author: Kan Zhang <kz...@apache.org>
Date:   2014-08-15T02:03:51Z

    [SPARK-2736] PySpark converter and example script for reading Avro files
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2736
    
    This patch includes:
    1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
    2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
    3. Fixing a classloading issue.
    
    cc @MLnick @JoshRosen @mateiz
    
    Author: Kan Zhang <kz...@apache.org>
    
    Closes #1916 from kanzhang/SPARK-2736 and squashes the following commits:
    
    02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
    f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
    82cc505 [Kan Zhang] [SPARK-2736] Update data sample
    0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
    c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
    2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
    536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
    
    (cherry picked from commit 9422a9b084e3fd5b2b9be2752013588adfb430d0)
    Signed-off-by: Matei Zaharia <ma...@databricks.com>

commit d3cce5821ebdbe1e6a91bf7fe1efc00c23e62b08
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-11T03:36:54Z

    [SPARK-2936] Migrate Netty network module from Java to Scala
    
    The Netty network module was originally written when Scala 2.9.x had a bug that prevents a pure Scala implementation, and a subset of the files were done in Java. We have since upgraded to Scala 2.10, and can migrate all Java files now to Scala.
    
    https://github.com/netty/netty/issues/781
    
    https://github.com/mesos/spark/pull/522
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1865 from rxin/netty and squashes the following commits:
    
    332422f [Reynold Xin] Code review feedback
    ca9eeee [Reynold Xin] Minor update.
    7f1434b [Reynold Xin] [SPARK-2936] Migrate Netty network module from Java to Scala
    
    (cherry picked from commit ba28a8fcbc3ba432e7ea4d6f0b535450a6ec96c6)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit 3f23d2a38c3b6559902bc2ab6975ff6b0bec875e
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-15T02:01:33Z

    [SPARK-2468] Netty based block server / client module
    
    This is a rewrite of the original Netty module that was added about 1.5 years ago. The old code was turned off by default and didn't really work because it lacked a frame decoder (only worked with very very small blocks).
    
    For this pull request, I tried to make the changes non-instrusive to the rest of Spark. I only added an init and shutdown to BlockManager/DiskBlockManager, and a bunch of comments to help me understand the existing code base.
    
    Compared with the old Netty module, this one features:
    - It appears to work :)
    - SPARK-2941: option to specicy nio vs oio vs epoll for channel/transport. By default nio is used. (Not using Epoll yet because I have found some bugs with its implementation)
    - SPARK-2943: options to specify send buf and receive buf for users who want to do hyper tuning
    - SPARK-2942: io errors are reported from server to client (the protocol uses negative length to indicate error)
    - SPARK-2940: fetching multiple blocks in a single request to reduce syscalls
    - SPARK-2959: clients share a single thread pool
    - SPARK-2990: use PooledByteBufAllocator to reduce GC (basically a Netty managed pool of buffers with jmalloc)
    - SPARK-2625: added fetchWaitTime metric and fixed thread-safety issue in metrics update.
    - SPARK-2367: bump Netty version to 4.0.21.Final to address an Epoll bug (https://groups.google.com/forum/#!topic/netty/O7m-HxCJpCA)
    
    Compared with the existing communication manager, this one features:
    - IMO it is substantially easier to understand
    - zero-copy send for the server for on-disk blocks
    - one-copy receive (due to a frame decoder)
    - don't quote me on this, but I think a lot less sys calls
    - SPARK-2990: use PooledByteBufAllocator to reduce GC (basically a Netty managed pool of buffers with jmalloc)
    - SPARK-2941: option to specicy nio vs oio vs epoll for channel/transport. By default nio is used. (Not using Epoll yet because I have found some bugs with its implementation)
    - SPARK-2943: options to specify send buf and receive buf for users who want to do hyper tuning
    
    TODOs before it can fully replace the existing ConnectionManager, if that ever happens (most of them should probably be done in separate PRs since this needs to be turned on explicitly)
    - [x] Basic test cases
    - [ ] More unit/integration tests for failures
    - [ ] Performance analysis
    - [ ] Support client connection reuse so we don't need to keep opening new connections (not sure how useful this would be)
    - [ ] Support putting blocks in addition to fetching blocks (i.e. two way transfer)
    - [x] Support serving non-disk blocks
    - [ ] Support SASL authentication
    
    For a more comprehensive list, see https://issues.apache.org/jira/browse/SPARK-2468
    
    Thanks to @coderplay for peer coding with me on a Sunday.
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1907 from rxin/netty and squashes the following commits:
    
    f921421 [Reynold Xin] Upgrade Netty to 4.0.22.Final to fix another Epoll bug.
    4b174ca [Reynold Xin] Shivaram's code review comment.
    4a3dfe7 [Reynold Xin] Switched to nio for default (instead of epoll on Linux).
    56bfb9d [Reynold Xin] Bump Netty version to 4.0.21.Final for some bug fixes.
    b443a4b [Reynold Xin] Added debug message to help debug Jenkins failures.
    57fc4d7 [Reynold Xin] Added test cases for BlockHeaderEncoder and BlockFetchingClientHandlerSuite.
    22623e9 [Reynold Xin] Added exception handling and test case for BlockServerHandler and BlockFetchingClientHandler.
    6550dd7 [Reynold Xin] Fixed block mgr init bug.
    60c2edf [Reynold Xin] Beefed up server/client integration tests.
    38d88d5 [Reynold Xin] Added missing test files.
    6ce3f3c [Reynold Xin] Added some basic test cases.
    47f7ce0 [Reynold Xin] Created server and client packages and moved files there.
    b16f412 [Reynold Xin] Added commit count.
    f13022d [Reynold Xin] Remove unused clone() in BlockFetcherIterator.
    c57d68c [Reynold Xin] Added back missing files.
    842dfa7 [Reynold Xin] Made everything work with proper reference counting.
    3fae001 [Reynold Xin] Connected the new netty network module with rest of Spark.
    1a8f6d4 [Reynold Xin] Completed protocol documentation.
    2951478 [Reynold Xin] New Netty implementation.
    cc7843d [Reynold Xin] Basic skeleton.
    
    (cherry picked from commit 3a8b68b7353fea50245686903b308fa9eb52cb51)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit debb3e3df601bc64c97701565d2c992855f6cce9
Author: Anand Avati <av...@redhat.com>
Date:   2014-08-15T15:53:52Z

    [SPARK-2924] remove default args to overloaded methods
    
    Not supported in Scala 2.11. Split them into separate methods instead.
    
    Author: Anand Avati <av...@redhat.com>
    
    Closes #1704 from avati/SPARK-1812-default-args and squashes the following commits:
    
    3e3924a [Anand Avati] SPARK-1812: Add Mima excludes for the broken ABI
    901dfc7 [Anand Avati] SPARK-1812: core - Fix overloaded methods with default arguments
    07f00af [Anand Avati] SPARK-1812: streaming - Fix overloaded methods with default arguments
    (cherry picked from commit 7589c39d39a8d0744fb689e5752ee8e0108a81eb)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit b066af4efb8dc544576f9f818d4974ac129c2ba7
Author: Patrick Wendell <pw...@gmail.com>
Date:   2014-08-15T16:01:35Z

    Revert "[SPARK-2468] Netty based block server / client module"
    
    This reverts commit 3f23d2a38c3b6559902bc2ab6975ff6b0bec875e.

commit 63376a0eeffa611ccfdf1e023bc0cf3393d70139
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-08-15T18:35:08Z

    SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetrics...
    
    ...Update
    
    Author: Sandy Ryza <sa...@cloudera.com>
    
    Closes #1961 from sryza/sandy-spark-3028 and squashes the following commits:
    
    dccdff5 [Sandy Ryza] Fix compile error
    f883ded [Sandy Ryza] SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetricsUpdate
    (cherry picked from commit 0afe5cb65a195d2f14e8dfcefdbec5dac023651f)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit 407ea9fd6f68ff3237726841b80dec61cbc7f51c
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-15T21:50:10Z

    [SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix
    
    DecisionTree improvements:
    (1) TreePoint representation to avoid binning multiple times
    (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
    (3) Timing for DecisionTree internals
    
    Details:
    
    (1) TreePoint representation to avoid binning multiple times
    
    [https://issues.apache.org/jira/browse/SPARK-3022]
    
    Added private[tree] TreePoint class for representing binned feature values.
    
    The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached.  This avoids the previous problem of re-computing bins multiple times.
    
    (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
    
    [https://issues.apache.org/jira/browse/SPARK-3041]
    
    isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
    * exhibited for unordered features (multi-class classification with categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    (3) Timing for DecisionTree internals
    
    Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
    Prints timing info via logDebug.
    
    CC: mengxr manishamde chouqin  Very similar update, with one bug fix.  Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043
    
    Author: Joseph K. Bradley <jo...@gmail.com>
    
    Closes #1950 from jkbradley/dt-opt1 and squashes the following commits:
    
    5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
    6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
    2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
    430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
    d036089 [Joseph K. Bradley] Print timing info to logDebug.
    e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
    8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
    a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
    0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
    3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
    f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
    bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
    511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
    a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
    
    (cherry picked from commit c7032290a3f0f5545aa4f0a9a144c62571344dc8)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 077213bae09737ccb904f07b2766d43bb0734477
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-16T00:04:15Z

    [SPARK-3046] use executor's class loader as the default serializer classloader
    
    The serializer is not always used in an executor thread (e.g. connection manager, broadcast), in which case the classloader might not have the user jar set, leading to corruption in deserialization.
    
    https://issues.apache.org/jira/browse/SPARK-3046
    
    https://issues.apache.org/jira/browse/SPARK-2878
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1972 from rxin/kryoBug and squashes the following commits:
    
    c1c7bf0 [Reynold Xin] Made change to JavaSerializer.
    7204c33 [Reynold Xin] Added imports back.
    d879e67 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer class loader.
    
    (cherry picked from commit cc3648774e9a744850107bb187f2828d447e0a48)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit c085011cac4df1bf4cbaef00a8b921ace6e3123b
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-16T04:04:29Z

    [SPARK-3078][MLLIB] Make LRWithLBFGS API consistent with others
    
    Should ask users to set parameters through the optimizer. dbtsai
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #1973 from mengxr/lr-lbfgs and squashes the following commits:
    
    e3efbb1 [Xiangrui Meng] fix tests
    21b3579 [Xiangrui Meng] fix method name
    641eea4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lr-lbfgs
    456ab7c [Xiangrui Meng] update LRWithLBFGS
    
    (cherry picked from commit 5d25c0b74f6397d78164b96afb8b8cbb1b15cfbd)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit ce06d7f45bc551f6121c382b0833e01b8a83f636
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-16T04:07:55Z

    [SPARK-3001][MLLIB] Improve Spearman's correlation
    
    The current implementation requires sorting individual columns, which could be done with a global sort.
    
    result on a 32-node cluster:
    
    m | n | prev | this
    ---|---|-------|-----
    1000000 | 50 | 55s | 9s
    10000000 | 50 | 97s | 76s
    1000000 | 100  | 119s | 15s
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #1917 from mengxr/spearman and squashes the following commits:
    
    4d5d262 [Xiangrui Meng] remove unused import
    85c48de [Xiangrui Meng] minor updates
    a048d0c [Xiangrui Meng] remove cache and set a limit to cachedIds
    b98bb18 [Xiangrui Meng] add comments
    0846e07 [Xiangrui Meng] first version
    
    (cherry picked from commit 2e069ca6560bf7ab07bd019f9530b42f4fe45014)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org