You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chesterxgchen <gi...@git.apache.org> on 2014/08/21 22:55:40 UTC
[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...
GitHub user chesterxgchen opened a pull request:
https://github.com/apache/spark/pull/2085
Spark 3175: Branch-1.1 SBT build failed for Yarn-Alpha
The issue is that the yarn/alpha/pom.xml using 1.1.0 instead of 1.1.1-SNAPSHOT version.
update the pom.xml to 1.1.1-SNAPSHOT (same as yarn/stable/pom.xml)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AlpineNow/spark SPARK-3175
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2085.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2085
----
commit e22110879cd149e94c9a5ca7466f787033572b15
Author: Andrew Or <an...@gmail.com>
Date: 2014-08-02T19:11:50Z
[HOTFIX] Do not throw NPE if spark.test.home is not set
`spark.test.home` was introduced in #1734. This is fine for SBT but is failing maven tests. Either way it shouldn't throw an NPE.
Author: Andrew Or <an...@gmail.com>
Closes #1739 from andrewor14/fix-spark-test-home and squashes the following commits:
ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set
commit 8d6ac2b95ab48d9fffe82ef04cef3b22c2c139e0
Author: Joseph K. Bradley <jo...@gmail.com>
Date: 2014-08-02T20:07:17Z
[SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.
API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()
Examples and testing:
* Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
CC mengxr manishamde
Author: Joseph K. Bradley <jo...@gmail.com>
Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior. (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
fa10ea7 [Joseph K. Bradley] Small style update
7968692 [Joseph K. Bradley] small braces typo fix
e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
93953f1 [Joseph K. Bradley] Likely done with Python API.
6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
2b20c61 [Joseph K. Bradley] Small doc and style updates
1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification. Caused problems in past, but fixed now.
8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
2283df8 [Joseph K. Bradley] 2 bug fixes.
73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail. Will describe bug in next commit.
f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree. Also added toString, depth, and numNodes methods to DecisionTreeModel.
(cherry picked from commit 3f67382e7c9c3f6a8f6ce124ab3fcb1a9c1a264f)
Signed-off-by: Xiangrui Meng <me...@databricks.com>
commit 91de0dc1654d609dc1ff8fa9a07ba18043ad61c6
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date: 2014-08-02T20:16:41Z
[SQL] Set outputPartitioning of BroadcastHashJoin correctly.
I think we will not generate the plan triggering this bug at this moment. But, let me explain it...
Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
```sql
SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
FROM src
GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key
```
Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.
JIRA is being reindexed. I will create a JIRA ticket once it is back online.
Author: Yin Huai <hu...@cse.ohio-state.edu>
Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:
96d9cb3 [Yin Huai] Set outputPartitioning correctly.
(cherry picked from commit 67bd8e3c217a80c3117a6e3853aa60fe13d08c91)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit bb0ac6d7c91c491a99c252e6cb4aea40efe9b190
Author: Chris Fregly <ch...@fregly.com>
Date: 2014-08-02T20:35:35Z
[SPARK-1981] Add AWS Kinesis streaming support
Author: Chris Fregly <ch...@fregly.com>
Closes #1434 from cfregly/master and squashes the following commits:
4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback. simplified examples, updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback: moved the kinesis examples into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback: updated docs, simplified the KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
(cherry picked from commit 91f9504e6086fac05b40545099f9818949c24bca)
Signed-off-by: Tathagata Das <ta...@gmail.com>
commit 7924d72cf8aae945d72f355c54c4fcb3d62e6c48
Author: GuoQiang Li <wi...@qq.com>
Date: 2014-08-02T20:55:28Z
SPARK-2804: Remove scalalogging-slf4j dependency
This also Closes #1701.
Author: GuoQiang Li <wi...@qq.com>
Closes #1208 from witgo/SPARK-1470 and squashes the following commits:
422646b [GuoQiang Li] Remove scalalogging-slf4j dependency
commit 3b9f25f4259b254f3faa2a7d61e547089a69c259
Author: Michael Armbrust <mi...@databricks.com>
Date: 2014-08-02T23:33:48Z
[SPARK-2097][SQL] UDF Support
This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.
Scala:
```scala
registerFunction("strLenScala", (_: String).length)
sql("SELECT strLenScala('test')")
```
Python:
```python
sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
sqlCtx.sql("SELECT strLenPython('test')")
```
Java:
```java
sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
Override
public Integer call(String str) throws Exception {
return str.length();
}
}, DataType.IntegerType);
sqlContext.sql("SELECT stringLengthJava('test')");
```
Author: Michael Armbrust <mi...@databricks.com>
Closes #1063 from marmbrus/udfs and squashes the following commits:
9eda0fe [Michael Armbrust] newline
747c05e [Michael Armbrust] Add some scala UDF tests.
d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
005d684 [Michael Armbrust] Fix naming and formatting.
d14dac8 [Michael Armbrust] Fix last line of autogened java files.
8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
7a83101 [Michael Armbrust] Drop toString
795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
e54fb45 [Michael Armbrust] Docs and tests.
437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
8e6c932 [Michael Armbrust] WIP
3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
6237c8d [Michael Armbrust] WIP
2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
(cherry picked from commit 158ad0bba9382fd494b4789b5628a9cec00cfa19)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 4230df4e1d6c59dc3405f46f5edf18c3825a5447
Author: Michael Armbrust <mi...@databricks.com>
Date: 2014-08-02T23:48:07Z
[SPARK-2785][SQL] Remove assertions that throw when users try unsupported Hive commands.
Author: Michael Armbrust <mi...@databricks.com>
Closes #1742 from marmbrus/asserts and squashes the following commits:
5182d54 [Michael Armbrust] Remove assertions that throw when users try unsupported Hive commands.
(cherry picked from commit 198df11f1a9f419f820f47eba0e9f2ab371a824b)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 460fad817da1fb6619d2456f637c1b7c7f5e8c7c
Author: Cheng Lian <li...@gmail.com>
Date: 2014-08-03T00:12:49Z
[SPARK-2729][SQL] Added test case for SPARK-2729
This is a follow up of #1636.
Author: Cheng Lian <li...@gmail.com>
Closes #1738 from liancheng/test-for-spark-2729 and squashes the following commits:
b13692a [Cheng Lian] Added test case for SPARK-2729
(cherry picked from commit 866cf1f822cfda22294054be026ef2d96307eb75)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 5ef828273deb4713a49700c56d51bdd980917cfd
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date: 2014-08-03T00:55:22Z
[SPARK-2797] [SQL] SchemaRDDs don't support unpersist()
The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
Author: Yin Huai <hu...@cse.ohio-state.edu>
Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called without the input parameter (blocking) from PySpark.
(cherry picked from commit d210022e96804e59e42ab902e53637e50884a9ab)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 5b30e001839a29e6c4bd1fc24bfa12d9166ef10c
Author: Michael Armbrust <mi...@databricks.com>
Date: 2014-08-03T01:27:04Z
[SPARK-2739][SQL] Rename registerAsTable to registerTempTable
There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle. This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening. `registerAsTable` remains, but will cause a deprecation warning.
Author: Michael Armbrust <mi...@databricks.com>
Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
4dff086 [Michael Armbrust] Fix .java files too
89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
(cherry picked from commit 1a8043739dc1d9435def6ea3c6341498ba52b708)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 0d47bb642f645c3c8663f4bdf869b5337ef9cb35
Author: Sean Owen <sr...@gmail.com>
Date: 2014-08-03T04:44:19Z
SPARK-2602 [BUILD] Tests steal focus under Java 6
As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be resolved for Java 6 with the java.awt.headless system property, which never hurt anyone running a command line app. I tested it and seemed to get rid of focus stealing.
Author: Sean Owen <sr...@gmail.com>
Closes #1747 from srowen/SPARK-2602 and squashes the following commits:
b141018 [Sean Owen] Set java.awt.headless during tests
(cherry picked from commit 33f167d762483b55d5d874dcc1e3075f661d4375)
Signed-off-by: Patrick Wendell <pw...@gmail.com>
commit c137928cbe74446254fdbd656c50c1a1c8930094
Author: Sean Owen <sr...@gmail.com>
Date: 2014-08-03T04:55:56Z
SPARK-2414 [BUILD] Add LICENSE entry for jquery
The JIRA concerned removing jquery, and this does not remove jquery. While it is distributed by Spark it should have an accompanying line in LICENSE, very technically, as per http://www.apache.org/dev/licensing-howto.html
Author: Sean Owen <sr...@gmail.com>
Closes #1748 from srowen/SPARK-2414 and squashes the following commits:
2fdb03c [Sean Owen] Add LICENSE entry for jquery
(cherry picked from commit 9cf429aaf529e91f619910c33cfe46bf33a66982)
Signed-off-by: Patrick Wendell <pw...@gmail.com>
commit fb2a2079fa10ea8f338d68945a94238dda9fbd66
Author: Andrew Or <an...@gmail.com>
Date: 2014-08-03T05:00:46Z
[Minor] Fixes on top of #1679
Minor fixes on top of #1679.
Author: Andrew Or <an...@gmail.com>
Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:
3b46f5e [Andrew Or] Minor fixes
(cherry picked from commit 3dc55fdf450b4237f7c592fce56d1467fd206366)
Signed-off-by: Patrick Wendell <pw...@gmail.com>
commit 1992175fd93f0239e5a09e0b8db99ad9af7f380c
Author: Stephen Boesch <ja...@gmail.com>
Date: 2014-08-03T17:19:04Z
SPARK-2712 - Add a small note to maven doc that mvn package must happen ...
Per request by Reynold adding small note about proper sequencing of build then test.
Author: Stephen Boesch <ja...@gmail.com>
Closes #1615 from javadba/docs and squashes the following commits:
6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test
(cherry picked from commit f8cd143b6b1b4d8aac87c229e5af263b0319b3ea)
Signed-off-by: Patrick Wendell <pw...@gmail.com>
commit 162fc9512018e0c592b3aaa29d405f511461795a
Author: Allan Douglas R. de Oliveira <al...@chaordicsystems.com>
Date: 2014-08-03T17:25:59Z
SPARK-2246: Add user-data option to EC2 scripts
Author: Allan Douglas R. de Oliveira <al...@chaordicsystems.com>
Closes #1186 from douglaz/spark_ec2_user_data and squashes the following commits:
94a36f9 [Allan Douglas R. de Oliveira] Added user data option to EC2 script
(cherry picked from commit a0bcbc159e89be868ccc96175dbf1439461557e1)
Signed-off-by: Patrick Wendell <pw...@gmail.com>
commit eaa93555a7f935b00a2f94a7fa50a12e11578bd7
Author: Joseph K. Bradley <jo...@gmail.com>
Date: 2014-08-03T17:36:52Z
[SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
* Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.
Other improvements to Decision Trees for easy-of-use with Java:
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
--> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently. I suspect we will redo the API before the other options are included.
CC: mengxr
Author: Joseph K. Bradley <jo...@gmail.com>
Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently. I suspect we will redo the API before the other options are included.
d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
(cherry picked from commit 2998e38a942351974da36cb619e863c6f0316e7a)
Signed-off-by: Xiangrui Meng <me...@databricks.com>
commit c5ed1deba6b3f3e597554a8d0f93f402ae62fab9
Author: Michael Armbrust <mi...@databricks.com>
Date: 2014-08-03T19:28:29Z
[SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
Many users have reported being confused by the distinction between the `sql` and `hql` methods. Specifically, many users think that `sql(...)` cannot be used to read hive tables. In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing. For SQLContext this must be set to `sql`. In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
**This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
Author: Michael Armbrust <mi...@databricks.com>
Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
20c43f8 [Michael Armbrust] override function instead of just setting the value
7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
(cherry picked from commit 236dfac6769016e433b2f6517cda2d308dea74bc)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 6ffdcc61fb4825f991b754c45b807192f483a4a3
Author: Cheng Lian <li...@gmail.com>
Date: 2014-08-03T19:34:46Z
[SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native commands
JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
Author: Cheng Lian <li...@gmail.com>
Closes #1753 from liancheng/spark-2814 and squashes the following commits:
c74a3b2 [Cheng Lian] Fixed SPARK-2814
(cherry picked from commit ac33cbbf33bd1ab29bc8165c9be02fb8934b1fdf)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit 7c6afdac867d52447221438ed7508123c07d17f8
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date: 2014-08-03T21:54:41Z
[SPARK-2783][SQL] Basic support for analyze in HiveContext
JIRA: https://issues.apache.org/jira/browse/SPARK-2783
Author: Yin Huai <hu...@cse.ohio-state.edu>
Closes #1741 from yhuai/analyzeTable and squashes the following commits:
7bb5f02 [Yin Huai] Use sql instead of hql.
4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
e3ebcd4 [Yin Huai] Renaming.
c170f4e [Yin Huai] Do not use getContentSummary.
62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
db233a6 [Yin Huai] Trying to debug jenkins...
fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
f0501f3 [Yin Huai] Fix compilation error.
24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
8918140 [Yin Huai] Wording.
23df227 [Yin Huai] Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.
(cherry picked from commit e139e2be60ef23281327744e1b3e74904dfdf63f)
Signed-off-by: Michael Armbrust <mi...@databricks.com>
commit a4cdb77e5ee2c80967a7b6cd7370170fabe56cd2
Author: Davies Liu <da...@gmail.com>
Date: 2014-08-03T22:52:00Z
[SPARK-1740] [PySpark] kill the python worker
Kill only the python worker related to cancelled tasks.
The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.
When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.
Author: Davies Liu <da...@gmail.com>
Closes #1643 from davies/kill and squashes the following commits:
8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
46ca150 [Davies Liu] address comment
acd751c [Davies Liu] kill the worker when task is canceled
(cherry picked from commit 55349f9fe81ba5af5e4a5e4908ebf174e63c6cc9)
Signed-off-by: Josh Rosen <jo...@apache.org>
commit 4784d24eadea2e1adf69d8fe4891bdce29188dd6
Author: Anand Avati <av...@redhat.com>
Date: 2014-08-04T00:47:49Z
[SPARK-2810] upgrade to scala-maven-plugin 3.2.0
Needed for Scala 2.11 compiler-interface
Signed-off-by: Anand Avati <avatiredhat.com>
Author: Anand Avati <av...@redhat.com>
Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the following commits:
9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0
commit 2152e24d64d6a07cf6c550c9f13ab0231596be98
Author: Sarah Gerweck <sa...@gmail.com>
Date: 2014-08-04T02:47:05Z
Fix some bugs with spaces in directory name.
Any time you use the directory name (`FWDIR`) it needs to be surrounded
in quotes. If you're also using wildcards, you can safely put the quotes
around just `$FWDIR`.
Author: Sarah Gerweck <sa...@gmail.com>
Closes #1756 from sarahgerweck/folderSpaces and squashes the following commits:
732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
(cherry picked from commit 5507dd8e18fbb52d5e0c64a767103b2418cb09c6)
Signed-off-by: Patrick Wendell <pw...@gmail.com>
commit 9aa14598f89bb8b908222e37f965178d39c34fe6
Author: DB Tsai <db...@alpinenow.com>
Date: 2014-08-04T04:39:21Z
SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.
There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.
1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
2) `Normalizer` - Normalizes samples individually to unit L^n norm
Author: DB Tsai <db...@alpinenow.com>
Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
78c15d3 [DB Tsai] Alpine Data Labs
(cherry picked from commit ae58aea2d1435b5bb011e68127e1bcddc2edf5b2)
Signed-off-by: Xiangrui Meng <me...@databricks.com>
commit 3823f6d25e2a89ca1bfa62a76f6e708c2c63f064
Author: Liquan Pei <lp...@gopivotal.com>
Date: 2014-08-04T06:55:58Z
[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words
This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :
taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876
The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932
Author: Liquan Pei <lp...@gopivotal.com>
Author: Xiangrui Meng <me...@databricks.com>
Author: Liquan Pei <li...@gmail.com>
Closes #1719 from Ishiihara/master and squashes the following commits:
2ba9483 [Liquan Pei] minor fix for Word2Vec test
e248441 [Liquan Pei] minor style change
26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
c14da41 [Xiangrui Meng] fix styles
384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
e93e726 [Liquan Pei] use treeAggregate instead of aggregate
1a8fb41 [Liquan Pei] use weighted sum in combOp
7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
6bcc8be [Liquan Pei] add multiple iteration support
720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
2e92b59 [Liquan Pei] modify according to feedback
57dc50d [Liquan Pei] code formatting
e4a04d3 [Liquan Pei] minor fix
0aafb1b [Liquan Pei] Add comments, minor fixes
8d6befe [Liquan Pei] initial commit
(cherry picked from commit e053c55819363fab7068bb9165e3379f0c2f570c)
Signed-off-by: Xiangrui Meng <me...@databricks.com>
commit bfd2f39581d958d5aafaa76994f44213bcdfbb69
Author: Davies Liu <da...@gmail.com>
Date: 2014-08-04T19:13:41Z
[SPARK-1687] [PySpark] pickable namedtuple
Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
PS: pyspark should be import BEFORE "from collections import namedtuple"
Author: Davies Liu <da...@gmail.com>
Closes #1623 from davies/namedtuple and squashes the following commits:
045dad8 [Davies Liu] remove unrelated code changes
4132f32 [Davies Liu] address comment
55b1c1a [Davies Liu] fix tests
61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
f7b1bde [Davies Liu] add hack for CloudPickleSerializer
0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
93b03b8 [Davies Liu] pickable namedtuple
(cherry picked from commit 59f84a9531f7974a053fd4963ce9afd88273ea4c)
Signed-off-by: Josh Rosen <jo...@apache.org>
commit aa7a48ee905b95e57f64051ea887d4775b427603
Author: Matei Zaharia <ma...@databricks.com>
Date: 2014-08-04T19:59:18Z
SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter
All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
Author: Matei Zaharia <ma...@databricks.com>
Closes #1722 from mateiz/spark-2792 and squashes the following commits:
5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
18fe865 [Matei Zaharia] Update docs on objectStreamReset
576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
0374217 [Matei Zaharia] Remove super paranoid code to close file handles
bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
commit 2225d18a751b7a4470a93f3d9edebe0d33df75c8
Author: Davies Liu <da...@gmail.com>
Date: 2014-08-04T22:54:52Z
[SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple
serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
Author: Davies Liu <da...@gmail.com>
Closes #1771 from davies/fix and squashes the following commits:
1a9e336 [Davies Liu] fix unit tests
(cherry picked from commit 9fd82dbbcb8b10debbe95f1acab53ae8b340f38e)
Signed-off-by: Josh Rosen <jo...@apache.org>
commit 4ed7b5a2ff08eccf23d90990a4d7a2663efaf204
Author: Reynold Xin <rx...@apache.org>
Date: 2014-08-05T03:39:18Z
[SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext
Author: Reynold Xin <rx...@apache.org>
Closes #1772 from rxin/accumulator-dagscheduler and squashes the following commits:
6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext.
(cherry picked from commit 05bf4e4aff0d052a53d3e64c43688f07e27fec50)
Signed-off-by: Reynold Xin <rx...@apache.org>
commit a0922854909176a24cc689a7e8595303dcf62f3f
Author: Matei Zaharia <ma...@databricks.com>
Date: 2014-08-05T06:27:53Z
SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove()
Replaces this with an O(1) operation that does not have to shift over
the whole tail of the array into the gap produced by the element removed.
Author: Matei Zaharia <ma...@databricks.com>
Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse ArrayBuffers
eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid buffer.remove()
(cherry picked from commit 066765d60d21b6b9943862b788e4a4bd07396e6c)
Signed-off-by: Matei Zaharia <ma...@databricks.com>
commit d13d253fea6dd1f666c4c94087173f734843f2b5
Author: Matei Zaharia <ma...@databricks.com>
Date: 2014-08-05T06:41:03Z
SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections
This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).
Author: Matei Zaharia <ma...@databricks.com>
Closes #1707 from mateiz/spark-2711 and squashes the following commits:
debf75b [Matei Zaharia] Review comments
24f28f3 [Matei Zaharia] Small rename
c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
315e3a5 [Matei Zaharia] Some review comments
b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
(cherry picked from commit 4fde28c2063f673ec7f51d514ba62a73321960a1)
Signed-off-by: Matei Zaharia <ma...@databricks.com>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...
Posted by chesterxgchen <gi...@git.apache.org>.
Github user chesterxgchen commented on the pull request:
https://github.com/apache/spark/pull/2085#issuecomment-53016327
Let me close it and re-generate this
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...
Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:
https://github.com/apache/spark/pull/2085#issuecomment-53009003
@chesterxgchen could you check how you submitted the PR? You seem to be merging a lot of unrelated things here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...
Posted by chesterxgchen <gi...@git.apache.org>.
Github user chesterxgchen commented on the pull request:
https://github.com/apache/spark/pull/2085#issuecomment-53014761
I only changed one line of code in each PR
That's strange, let me take a look
Sent from my iPhone
> On Aug 21, 2014, at 5:56 PM, Marcelo Vanzin <no...@github.com> wrote:
>
> @chesterxgchen could you check how you submitted the PR? You seem to be merging a lot of unrelated things here.
>
> —
> Reply to this email directly or view it on GitHub.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...
Posted by chesterxgchen <gi...@git.apache.org>.
Github user chesterxgchen closed the pull request at:
https://github.com/apache/spark/pull/2085
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: Spark 3175: Branch-1.1 SBT build failed for Ya...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2085#issuecomment-52983538
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org