You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by chesterxgchen <gi...@git.apache.org> on 2014/08/21 22:55:40 UTC

[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...

GitHub user chesterxgchen opened a pull request:

    https://github.com/apache/spark/pull/2085

    Spark 3175: Branch-1.1 SBT build failed for Yarn-Alpha

        The issue is that the yarn/alpha/pom.xml using 1.1.0 instead of 1.1.1-SNAPSHOT version.
        update the pom.xml to 1.1.1-SNAPSHOT (same as yarn/stable/pom.xml)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AlpineNow/spark SPARK-3175

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2085.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2085
    
----
commit e22110879cd149e94c9a5ca7466f787033572b15
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-02T19:11:50Z

    [HOTFIX] Do not throw NPE if spark.test.home is not set
    
    `spark.test.home` was introduced in #1734. This is fine for SBT but is failing maven tests. Either way it shouldn't throw an NPE.
    
    Author: Andrew Or <an...@gmail.com>
    
    Closes #1739 from andrewor14/fix-spark-test-home and squashes the following commits:
    
    ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set

commit 8d6ac2b95ab48d9fffe82ef04cef3b22c2c139e0
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-02T20:07:17Z

    [SPARK-2478] [mllib] DecisionTree Python API
    
    Added experimental Python API for Decision Trees.
    
    API:
    * class DecisionTreeModel
    ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
    ** numNodes()
    ** depth()
    ** __str__()
    * class DecisionTree
    ** trainClassifier()
    ** trainRegressor()
    ** train()
    
    Examples and testing:
    * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
    * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
    
    Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
    
    CC mengxr manishamde
    
    Author: Joseph K. Bradley <jo...@gmail.com>
    
    Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
    
    3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
    6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
    67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
    aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
    fa10ea7 [Joseph K. Bradley] Small style update
    7968692 [Joseph K. Bradley] small braces typo fix
    e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
    db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
    6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    93953f1 [Joseph K. Bradley] Likely done with Python API.
    6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
    188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
    2b20c61 [Joseph K. Bradley] Small doc and style updates
    1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
    8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
    376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
    e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
    52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
    8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
    cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
    8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
    5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    2283df8 [Joseph K. Bradley] 2 bug fixes.
    73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
    5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
    f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
    
    (cherry picked from commit 3f67382e7c9c3f6a8f6ce124ab3fcb1a9c1a264f)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 91de0dc1654d609dc1ff8fa9a07ba18043ad61c6
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-02T20:16:41Z

    [SQL] Set outputPartitioning of BroadcastHashJoin correctly.
    
    I think we will not generate the plan triggering this bug at this moment. But, let me explain it...
    
    Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
    ```sql
    SELECT l.key, count(*)
    FROM (SELECT key, count(*) as cnt
          FROM src
          GROUP BY key) l // This is buildPlan
    JOIN r // This is the streamedPlan
    ON (l.cnt = r.value)
    GROUP BY l.key
    ```
    Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.
    
    JIRA is being reindexed. I will create a JIRA ticket once it is back online.
    
    Author: Yin Huai <hu...@cse.ohio-state.edu>
    
    Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:
    
    96d9cb3 [Yin Huai] Set outputPartitioning correctly.
    
    (cherry picked from commit 67bd8e3c217a80c3117a6e3853aa60fe13d08c91)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit bb0ac6d7c91c491a99c252e6cb4aea40efe9b190
Author: Chris Fregly <ch...@fregly.com>
Date:   2014-08-02T20:35:35Z

    [SPARK-1981] Add AWS Kinesis streaming support
    
    Author: Chris Fregly <ch...@fregly.com>
    
    Closes #1434 from cfregly/master and squashes the following commits:
    
    4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
    0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
    691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
    0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
    e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
    d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
    912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
    db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
    338997e [Chris Fregly] improve build docs for kinesis
    828f8ae [Chris Fregly] more cleanup
    e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    cd68c0d [Chris Fregly] fixed typos and backward compatibility
    d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
    b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
    
    (cherry picked from commit 91f9504e6086fac05b40545099f9818949c24bca)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 7924d72cf8aae945d72f355c54c4fcb3d62e6c48
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-08-02T20:55:28Z

    SPARK-2804: Remove scalalogging-slf4j dependency
    
    This also Closes #1701.
    
    Author: GuoQiang Li <wi...@qq.com>
    
    Closes #1208 from witgo/SPARK-1470 and squashes the following commits:
    
    422646b [GuoQiang Li] Remove scalalogging-slf4j dependency

commit 3b9f25f4259b254f3faa2a7d61e547089a69c259
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-02T23:33:48Z

    [SPARK-2097][SQL] UDF Support
    
    This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.
    
    Scala:
    ```scala
    registerFunction("strLenScala", (_: String).length)
    sql("SELECT strLenScala('test')")
    ```
    Python:
    ```python
    sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
    sqlCtx.sql("SELECT strLenPython('test')")
    ```
    Java:
    ```java
    sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
      Override
      public Integer call(String str) throws Exception {
        return str.length();
      }
    }, DataType.IntegerType);
    
    sqlContext.sql("SELECT stringLengthJava('test')");
    ```
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #1063 from marmbrus/udfs and squashes the following commits:
    
    9eda0fe [Michael Armbrust] newline
    747c05e [Michael Armbrust] Add some scala UDF tests.
    d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
    005d684 [Michael Armbrust] Fix naming and formatting.
    d14dac8 [Michael Armbrust] Fix last line of autogened java files.
    8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
    40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
    6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
    7a83101 [Michael Armbrust] Drop toString
    795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
    e54fb45 [Michael Armbrust] Docs and tests.
    437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
    01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
    8e6c932 [Michael Armbrust] WIP
    3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
    6237c8d [Michael Armbrust] WIP
    2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
    0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
    
    (cherry picked from commit 158ad0bba9382fd494b4789b5628a9cec00cfa19)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 4230df4e1d6c59dc3405f46f5edf18c3825a5447
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-02T23:48:07Z

    [SPARK-2785][SQL] Remove assertions that throw when users try unsupported Hive commands.
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #1742 from marmbrus/asserts and squashes the following commits:
    
    5182d54 [Michael Armbrust] Remove assertions that throw when users try unsupported Hive commands.
    
    (cherry picked from commit 198df11f1a9f419f820f47eba0e9f2ab371a824b)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 460fad817da1fb6619d2456f637c1b7c7f5e8c7c
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-03T00:12:49Z

    [SPARK-2729][SQL] Added test case for SPARK-2729
    
    This is a follow up of #1636.
    
    Author: Cheng Lian <li...@gmail.com>
    
    Closes #1738 from liancheng/test-for-spark-2729 and squashes the following commits:
    
    b13692a [Cheng Lian] Added test case for SPARK-2729
    
    (cherry picked from commit 866cf1f822cfda22294054be026ef2d96307eb75)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 5ef828273deb4713a49700c56d51bdd980917cfd
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-03T00:55:22Z

    [SPARK-2797] [SQL] SchemaRDDs don't support unpersist()
    
    The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
    
    Author: Yin Huai <hu...@cse.ohio-state.edu>
    
    Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
    
    7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called without the input parameter (blocking) from PySpark.
    
    (cherry picked from commit d210022e96804e59e42ab902e53637e50884a9ab)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 5b30e001839a29e6c4bd1fc24bfa12d9166ef10c
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-03T01:27:04Z

    [SPARK-2739][SQL] Rename registerAsTable to registerTempTable
    
    There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
    
    d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
    4dff086 [Michael Armbrust] Fix .java files too
    89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
    0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
    
    (cherry picked from commit 1a8043739dc1d9435def6ea3c6341498ba52b708)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 0d47bb642f645c3c8663f4bdf869b5337ef9cb35
Author: Sean Owen <sr...@gmail.com>
Date:   2014-08-03T04:44:19Z

    SPARK-2602 [BUILD] Tests steal focus under Java 6
    
    As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be resolved for Java 6 with the java.awt.headless system property, which never hurt anyone running a command line app. I tested it and seemed to get rid of focus stealing.
    
    Author: Sean Owen <sr...@gmail.com>
    
    Closes #1747 from srowen/SPARK-2602 and squashes the following commits:
    
    b141018 [Sean Owen] Set java.awt.headless during tests
    (cherry picked from commit 33f167d762483b55d5d874dcc1e3075f661d4375)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit c137928cbe74446254fdbd656c50c1a1c8930094
Author: Sean Owen <sr...@gmail.com>
Date:   2014-08-03T04:55:56Z

    SPARK-2414 [BUILD] Add LICENSE entry for jquery
    
    The JIRA concerned removing jquery, and this does not remove jquery. While it is distributed by Spark it should have an accompanying line in LICENSE, very technically, as per http://www.apache.org/dev/licensing-howto.html
    
    Author: Sean Owen <sr...@gmail.com>
    
    Closes #1748 from srowen/SPARK-2414 and squashes the following commits:
    
    2fdb03c [Sean Owen] Add LICENSE entry for jquery
    (cherry picked from commit 9cf429aaf529e91f619910c33cfe46bf33a66982)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit fb2a2079fa10ea8f338d68945a94238dda9fbd66
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-03T05:00:46Z

    [Minor] Fixes on top of #1679
    
    Minor fixes on top of #1679.
    
    Author: Andrew Or <an...@gmail.com>
    
    Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:
    
    3b46f5e [Andrew Or] Minor fixes
    (cherry picked from commit 3dc55fdf450b4237f7c592fce56d1467fd206366)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit 1992175fd93f0239e5a09e0b8db99ad9af7f380c
Author: Stephen Boesch <ja...@gmail.com>
Date:   2014-08-03T17:19:04Z

    SPARK-2712 - Add a small note to maven doc that mvn package must happen ...
    
    Per request by Reynold adding small note about proper sequencing of build then test.
    
    Author: Stephen Boesch <ja...@gmail.com>
    
    Closes #1615 from javadba/docs and squashes the following commits:
    
    6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
    5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test
    (cherry picked from commit f8cd143b6b1b4d8aac87c229e5af263b0319b3ea)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit 162fc9512018e0c592b3aaa29d405f511461795a
Author: Allan Douglas R. de Oliveira <al...@chaordicsystems.com>
Date:   2014-08-03T17:25:59Z

    SPARK-2246: Add user-data option to EC2 scripts
    
    Author: Allan Douglas R. de Oliveira <al...@chaordicsystems.com>
    
    Closes #1186 from douglaz/spark_ec2_user_data and squashes the following commits:
    
    94a36f9 [Allan Douglas R. de Oliveira] Added user data option to EC2 script
    (cherry picked from commit a0bcbc159e89be868ccc96175dbf1439461557e1)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit eaa93555a7f935b00a2f94a7fa50a12e11578bd7
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-03T17:36:52Z

    [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
    
    Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
    * Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.
    
    Other improvements to Decision Trees for easy-of-use with Java:
    * impurity classes: Added instance() methods to help with Java interface.
    * Strategy: Added Java-friendly constructor
    --> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
    
    CC: mengxr
    
    Author: Joseph K. Bradley <jo...@gmail.com>
    
    Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
    
    0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
    519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
    f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
    d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
    320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
    13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
    f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
    225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    
    (cherry picked from commit 2998e38a942351974da36cb619e863c6f0316e7a)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit c5ed1deba6b3f3e597554a8d0f93f402ae62fab9
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-03T19:28:29Z

    [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
    
    Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
    
    The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
    
    **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
    
    For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
    
    ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
    20c43f8 [Michael Armbrust] override function instead of just setting the value
    7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
    
    (cherry picked from commit 236dfac6769016e433b2f6517cda2d308dea74bc)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 6ffdcc61fb4825f991b754c45b807192f483a4a3
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-03T19:34:46Z

    [SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native commands
    
    JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
    
    Author: Cheng Lian <li...@gmail.com>
    
    Closes #1753 from liancheng/spark-2814 and squashes the following commits:
    
    c74a3b2 [Cheng Lian] Fixed SPARK-2814
    
    (cherry picked from commit ac33cbbf33bd1ab29bc8165c9be02fb8934b1fdf)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit 7c6afdac867d52447221438ed7508123c07d17f8
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-03T21:54:41Z

    [SPARK-2783][SQL] Basic support for analyze in HiveContext
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2783
    
    Author: Yin Huai <hu...@cse.ohio-state.edu>
    
    Closes #1741 from yhuai/analyzeTable and squashes the following commits:
    
    7bb5f02 [Yin Huai] Use sql instead of hql.
    4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    e3ebcd4 [Yin Huai] Renaming.
    c170f4e [Yin Huai] Do not use getContentSummary.
    62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    db233a6 [Yin Huai] Trying to debug jenkins...
    fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    f0501f3 [Yin Huai] Fix compilation error.
    24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
    8918140 [Yin Huai] Wording.
    23df227 [Yin Huai] Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.
    
    (cherry picked from commit e139e2be60ef23281327744e1b3e74904dfdf63f)
    Signed-off-by: Michael Armbrust <mi...@databricks.com>

commit a4cdb77e5ee2c80967a7b6cd7370170fabe56cd2
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-03T22:52:00Z

    [SPARK-1740] [PySpark] kill the python worker
    
    Kill only the python worker related to cancelled tasks.
    
    The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.
    
    When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.
    
    Author: Davies Liu <da...@gmail.com>
    
    Closes #1643 from davies/kill and squashes the following commits:
    
    8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
    46ca150 [Davies Liu] address comment
    acd751c [Davies Liu] kill the worker when task is canceled
    
    (cherry picked from commit 55349f9fe81ba5af5e4a5e4908ebf174e63c6cc9)
    Signed-off-by: Josh Rosen <jo...@apache.org>

commit 4784d24eadea2e1adf69d8fe4891bdce29188dd6
Author: Anand Avati <av...@redhat.com>
Date:   2014-08-04T00:47:49Z

    [SPARK-2810] upgrade to scala-maven-plugin 3.2.0
    
    Needed for Scala 2.11 compiler-interface
    
    Signed-off-by: Anand Avati <avatiredhat.com>
    
    Author: Anand Avati <av...@redhat.com>
    
    Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the following commits:
    
    9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0

commit 2152e24d64d6a07cf6c550c9f13ab0231596be98
Author: Sarah Gerweck <sa...@gmail.com>
Date:   2014-08-04T02:47:05Z

    Fix some bugs with spaces in directory name.
    
    Any time you use the directory name (`FWDIR`) it needs to be surrounded
    in quotes. If you're also using wildcards, you can safely put the quotes
    around just `$FWDIR`.
    
    Author: Sarah Gerweck <sa...@gmail.com>
    
    Closes #1756 from sarahgerweck/folderSpaces and squashes the following commits:
    
    732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
    (cherry picked from commit 5507dd8e18fbb52d5e0c64a767103b2418cb09c6)
    
    Signed-off-by: Patrick Wendell <pw...@gmail.com>

commit 9aa14598f89bb8b908222e37f965178d39c34fe6
Author: DB Tsai <db...@alpinenow.com>
Date:   2014-08-04T04:39:21Z

    SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
    
    Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
    
    In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.
    
    There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.
    
    1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
    
    2) `Normalizer` - Normalizes samples individually to unit L^n norm
    
    Author: DB Tsai <db...@alpinenow.com>
    
    Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
    
    78c15d3 [DB Tsai] Alpine Data Labs
    
    (cherry picked from commit ae58aea2d1435b5bb011e68127e1bcddc2edf5b2)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 3823f6d25e2a89ca1bfa62a76f6e708c2c63f064
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-04T06:55:58Z

    [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
    
    This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
    
    To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
    
    To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :
    
    taiwan 0.8077646146334014
    korea 0.740913304563621
    japan 0.7240667798885471
    republic 0.7107151279078352
    thailand 0.6953217332072862
    tibet 0.6916782118129544
    mongolia 0.6800858715972612
    macau 0.6794925677480378
    singapore 0.6594048695593799
    manchuria 0.658989931844148
    laos 0.6512978726001666
    nepal 0.6380792327845325
    mainland 0.6365469459587788
    myanmar 0.6358614338840394
    macedonia 0.6322366180313249
    xinjiang 0.6285291551708028
    russia 0.6279951236068411
    india 0.6272874944023487
    shanghai 0.6234544135576999
    macao 0.6220588462925876
    
    The result with 10 partitions and 5 iterations is:
    taiwan 0.8310495079388313
    india 0.7737171315919039
    japan 0.756777901233668
    korea 0.7429767187102452
    indonesia 0.7407557427278356
    pakistan 0.712883426985585
    mainland 0.7053379963140822
    thailand 0.696298191073948
    mongolia 0.693690656871415
    laos 0.6913069680735292
    macau 0.6903427690029617
    republic 0.6766381604813666
    malaysia 0.676460699141784
    singapore 0.6728790997360923
    malaya 0.672345232966194
    manchuria 0.6703732292753156
    macedonia 0.6637955686322028
    myanmar 0.6589462882439646
    kazakhstan 0.657017801081494
    cambodia 0.6542383836451932
    
    Author: Liquan Pei <lp...@gopivotal.com>
    Author: Xiangrui Meng <me...@databricks.com>
    Author: Liquan Pei <li...@gmail.com>
    
    Closes #1719 from Ishiihara/master and squashes the following commits:
    
    2ba9483 [Liquan Pei] minor fix for Word2Vec test
    e248441 [Liquan Pei] minor style change
    26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
    c14da41 [Xiangrui Meng] fix styles
    384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
    e93e726 [Liquan Pei] use treeAggregate instead of aggregate
    1a8fb41 [Liquan Pei] use weighted sum in combOp
    7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
    6bcc8be [Liquan Pei] add multiple iteration support
    720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
    2e92b59 [Liquan Pei] modify according to feedback
    57dc50d [Liquan Pei] code formatting
    e4a04d3 [Liquan Pei] minor fix
    0aafb1b [Liquan Pei] Add comments, minor fixes
    8d6befe [Liquan Pei] initial commit
    
    (cherry picked from commit e053c55819363fab7068bb9165e3379f0c2f570c)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit bfd2f39581d958d5aafaa76994f44213bcdfbb69
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-04T19:13:41Z

    [SPARK-1687] [PySpark] pickable namedtuple
    
    Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
    
    PS: pyspark should be import BEFORE "from collections import namedtuple"
    
    Author: Davies Liu <da...@gmail.com>
    
    Closes #1623 from davies/namedtuple and squashes the following commits:
    
    045dad8 [Davies Liu] remove unrelated code changes
    4132f32 [Davies Liu] address comment
    55b1c1a [Davies Liu] fix tests
    61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
    98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
    f7b1bde [Davies Liu] add hack for CloudPickleSerializer
    0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
    21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
    93b03b8 [Davies Liu] pickable namedtuple
    
    (cherry picked from commit 59f84a9531f7974a053fd4963ce9afd88273ea4c)
    Signed-off-by: Josh Rosen <jo...@apache.org>

commit aa7a48ee905b95e57f64051ea887d4775b427603
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-08-04T19:59:18Z

    SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter
    
    All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
    
    In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
    
    Author: Matei Zaharia <ma...@databricks.com>
    
    Closes #1722 from mateiz/spark-2792 and squashes the following commits:
    
    5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
    18fe865 [Matei Zaharia] Update docs on objectStreamReset
    576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
    0374217 [Matei Zaharia] Remove super paranoid code to close file handles
    bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
    0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
    9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes

commit 2225d18a751b7a4470a93f3d9edebe0d33df75c8
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-04T22:54:52Z

    [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple
    
    serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
    
    Author: Davies Liu <da...@gmail.com>
    
    Closes #1771 from davies/fix and squashes the following commits:
    
    1a9e336 [Davies Liu] fix unit tests
    
    (cherry picked from commit 9fd82dbbcb8b10debbe95f1acab53ae8b340f38e)
    Signed-off-by: Josh Rosen <jo...@apache.org>

commit 4ed7b5a2ff08eccf23d90990a4d7a2663efaf204
Author: Reynold Xin <rx...@apache.org>
Date:   2014-08-05T03:39:18Z

    [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext
    
    Author: Reynold Xin <rx...@apache.org>
    
    Closes #1772 from rxin/accumulator-dagscheduler and squashes the following commits:
    
    6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext.
    
    (cherry picked from commit 05bf4e4aff0d052a53d3e64c43688f07e27fec50)
    Signed-off-by: Reynold Xin <rx...@apache.org>

commit a0922854909176a24cc689a7e8595303dcf62f3f
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-08-05T06:27:53Z

    SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove()
    
    Replaces this with an O(1) operation that does not have to shift over
    the whole tail of the array into the gap produced by the element removed.
    
    Author: Matei Zaharia <ma...@databricks.com>
    
    Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
    
    1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse ArrayBuffers
    eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid buffer.remove()
    
    (cherry picked from commit 066765d60d21b6b9943862b788e4a4bd07396e6c)
    Signed-off-by: Matei Zaharia <ma...@databricks.com>

commit d13d253fea6dd1f666c4c94087173f734843f2b5
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-08-05T06:41:03Z

    SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections
    
    This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).
    
    Author: Matei Zaharia <ma...@databricks.com>
    
    Closes #1707 from mateiz/spark-2711 and squashes the following commits:
    
    debf75b [Matei Zaharia] Review comments
    24f28f3 [Matei Zaharia] Small rename
    c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
    315e3a5 [Matei Zaharia] Some review comments
    b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
    
    (cherry picked from commit 4fde28c2063f673ec7f51d514ba62a73321960a1)
    Signed-off-by: Matei Zaharia <ma...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...

Posted by chesterxgchen <gi...@git.apache.org>.

Github user chesterxgchen commented on the pull request:

    https://github.com/apache/spark/pull/2085#issuecomment-53016327
  
    Let me close it and re-generate this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2085#issuecomment-53009003
  
    @chesterxgchen could you check how you submitted the PR? You seem to be merging a lot of unrelated things here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...

Posted by chesterxgchen <gi...@git.apache.org>.

Github user chesterxgchen commented on the pull request:

    https://github.com/apache/spark/pull/2085#issuecomment-53014761
  
    I only changed one line of code in each  PR 
    
    That's strange, let me take a look 
    
    Sent from my iPhone
    
    > On Aug 21, 2014, at 5:56 PM, Marcelo Vanzin <no...@github.com> wrote:
    > 
    > @chesterxgchen could you check how you submitted the PR? You seem to be merging a lot of unrelated things here.
    > 
    > —
    > Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...

Posted by chesterxgchen <gi...@git.apache.org>.

Github user chesterxgchen closed the pull request at:

    https://github.com/apache/spark/pull/2085


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2085#issuecomment-52983538
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org