You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by damnMeddlingKid <gi...@git.apache.org> on 2015/10/28 23:21:57 UTC

[GitHub] spark pull request: New spark

GitHub user damnMeddlingKid opened a pull request:

    https://github.com/apache/spark/pull/9342

    New spark

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Shopify/spark new_spark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9342.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9342
    
----
commit 60b922795d0d6a5e0db96c11416804153e307810
Author: Zhang, Liye <li...@intel.com>
Date:   2015-01-08T18:40:26Z

    [SPARK-4989][CORE] avoid wrong eventlog conf cause cluster down in standalone mode
    
    when enabling eventlog in standalone mode, if give the wrong configuration, the standalone cluster will down (cause master restart, lose connection with workers).
    How to reproduce: just give an invalid value to "spark.eventLog.dir", for example: spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2. This will throw illegalArgumentException, which will cause the Master restart. And the whole cluster is not available.
    
    Author: Zhang, Liye <li...@intel.com>
    
    Closes #3824 from liyezhang556520/wrongConf4Cluster and squashes the following commits:
    
    3c24d98 [Zhang, Liye] revert change with logwarning and excetption for FileNotFoundException
    3c1ac2e [Zhang, Liye] change var to val
    a49c52f [Zhang, Liye] revert wrong modification
    12eee85 [Zhang, Liye] add more message in log and on webUI
    5c1fa33 [Zhang, Liye] cache exceptions when eventlog with wrong conf

commit a9940b5a04c905698f17940669a161fcd414284f
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2015-01-08T19:35:56Z

    [Minor] Fix the value represented by spark.executor.id for consistency.
    
    The property  `spark.executor.id` can represent both `driver` and `<driver>`  for one driver.
    It's inconsistent.
    
    This issue is minor so I didn't file this in JIRA.
    
    Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
    
    Closes #3812 from sarutak/fix-driver-identifier and squashes the following commits:
    
    d885498 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-driver-identifier
    4275663 [Kousuke Saruta] Fixed the value represented by spark.executor.id of local mode

commit b4fb97df2cbdd743656e000fefe471406619220c
Author: WangTaoTheTonic <ba...@aliyun.com>
Date:   2015-01-08T19:45:42Z

    [SPARK-5130][Deploy]Take yarn-cluster as cluster mode in spark-submit
    
    https://issues.apache.org/jira/browse/SPARK-5130
    
    Author: WangTaoTheTonic <ba...@aliyun.com>
    
    Closes #3929 from WangTaoTheTonic/SPARK-5130 and squashes the following commits:
    
    c490648 [WangTaoTheTonic] take yarn-cluster as cluster mode in spark-submit

commit 31d67152c2cbbe2e076003b3ff0d0a7e2f801549
Author: Eric Moyer <er...@yahoo.com>
Date:   2015-01-08T19:55:23Z

    Document that groupByKey will OOM for large keys
    
    This pull request is my own work and I license it under Spark's open-source license.
    
    This contribution is an improvement to the documentation. I documented that the maximum number of values per key for groupByKey is limited by available RAM (see [Datablox][datablox link] and [the spark mailing list][list link]).
    
    Just saying that better performance is available is not sufficient. Sometimes you need to do a group-by - your operation needs all the items available in order to complete. This warning explains the problem.
    
    [datablox link]: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
    [list link]: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html
    
    Author: Eric Moyer <er...@yahoo.com>
    
    Closes #3936 from RadixSeven/better-group-by-docs and squashes the following commits:
    
    5b6f4e9 [Eric Moyer] groupByKey docs naming updates
    238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys

commit 854319e589c89b2b6b4a9d02916f6f748fc5680a
Author: Fernando Otero (ZeoS) <fo...@gmail.com>
Date:   2015-01-08T20:42:54Z

    SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS configurable
    
    Author: Fernando Otero (ZeoS) <fo...@gmail.com>
    
    Closes #3953 from zeitos/storageLevel and squashes the following commits:
    
    0f070b9 [Fernando Otero (ZeoS)] fix imports
    6869e80 [Fernando Otero (ZeoS)] fix comment length
    90c9f7e [Fernando Otero (ZeoS)] fix comment length
    18a992e [Fernando Otero (ZeoS)] changing storage level

commit d9cad94b1df0200207ba03fb0168373ccc3a8597
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2015-01-08T21:43:09Z

    [SPARK-4973][CORE] Local directory in the driver of client-mode continues remaining even if application finished when external shuffle is enabled
    
    When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished.
    I think local directories for drivers should be deleted.
    
    Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
    
    Closes #3811 from sarutak/SPARK-4973 and squashes the following commits:
    
    ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory if it's the driver
    43770da [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973
    88feecd [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973
    d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala in order to delete local directories of the driver of local-mode when external shuffle service is enabled

commit b14068bf7b2dff450101d48a59e79761e3ca4eb2
Author: RJ Nowling <rn...@gmail.com>
Date:   2015-01-08T23:03:43Z

    [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to P...
    
    ...ySpark MLlib
    
    This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 .
    
    Author: RJ Nowling <rn...@gmail.com>
    
    Closes #3955 from rnowling/spark4891 and squashes the following commits:
    
    1236a01 [RJ Nowling] Fix Python style issues
    7a01a78 [RJ Nowling] Fix Python style issues
    174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib

commit 5a1b7a9c8a77b6d1ef5553490d0ccf291dfac06f
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2015-01-09T01:15:13Z

    [SPARK-4048] Enhance and extend hadoop-provided profile.
    
    This change does a few things to make the hadoop-provided profile more useful:
    
    - Create new profiles for other libraries / services that might be provided by the infrastructure
    - Simplify and fix the poms so that the profiles are only activated while building assemblies.
    - Fix tests so that they're able to run when the profiles are activated
    - Add a new env variable to be used by distributions that use these profiles to provide the runtime
      classpath for Spark jobs and daemons.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:
    
    82eb688 [Marcelo Vanzin] Add a comment.
    eb228c0 [Marcelo Vanzin] Fix borked merge.
    4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
    9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
    371ebee [Marcelo Vanzin] Review feedback.
    52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
    83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
    7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
    322f882 [Marcelo Vanzin] Fix merge fail.
    f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
    8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
    9640503 [Marcelo Vanzin] Cleanup child process log message.
    115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
    e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
    7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
    1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
    d1399ed [Marcelo Vanzin] Restore jetty dependency.
    82a54b9 [Marcelo Vanzin] Remove unused profile.
    5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
    1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
    f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
    9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
    d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
    4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
    417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
    2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
    1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
    284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.

commit 013e031d01dca052b94a094c08b7d7f76f640711
Author: Nicholas Chammas <ni...@gmail.com>
Date:   2015-01-09T01:42:08Z

    [SPARK-5122] Remove Shark from spark-ec2
    
    I moved the Spark-Shark version map [to the wiki](https://cwiki.apache.org/confluence/display/SPARK/Spark-Shark+version+mapping).
    
    This PR has a [matching PR in mesos/spark-ec2](https://github.com/mesos/spark-ec2/pull/89).
    
    Author: Nicholas Chammas <ni...@gmail.com>
    
    Closes #3939 from nchammas/remove-shark and squashes the following commits:
    
    66e0841 [Nicholas Chammas] fix style
    ceeab85 [Nicholas Chammas] show default Spark GitHub repo
    7270126 [Nicholas Chammas] validate Spark hashes
    db4935d [Nicholas Chammas] validate spark version upfront
    fc0d5b9 [Nicholas Chammas] remove Shark

commit 8a95a3e61580b1c1f6c0a3e124aa8469255db968
Author: WangTaoTheTonic <ba...@aliyun.com>
Date:   2015-01-09T14:10:09Z

    [SPARK-5169][YARN]fetch the correct max attempts
    
    Soryy for fetching the wrong max attempts in this commit https://github.com/apache/spark/commit/8fdd48959c93b9cf809f03549e2ae6c4687d1fcd.
    We need to fix it now.
    
    tgravescs
    
    If we set an spark.yarn.maxAppAttempts which is larger than `yarn.resourcemanager.am.max-attempts` in yarn side, it will be overrided as described here:
    >The maximum number of application attempts. It's a global setting for all application masters. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound. If it is, the resourcemanager will override it. The default number is set to 2, to allow at least one retry for AM.
    
    http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
    
    Author: WangTaoTheTonic <ba...@aliyun.com>
    
    Closes #3942 from WangTaoTheTonic/HOTFIX and squashes the following commits:
    
    9ac16ce [WangTaoTheTonic] fetch the correct max attempts

commit 82f1259aba249285fd271f9f20e095409cb4d20b
Author: Aaron Davidson <aa...@databricks.com>
Date:   2015-01-09T17:20:16Z

    [Minor] Fix test RetryingBlockFetcherSuite after changed config name
    
    Flakey due to the default retry interval being the same as our test's wait timeout.
    
    Author: Aaron Davidson <aa...@databricks.com>
    
    Closes #3972 from aarondav/fix-test and squashes the following commits:
    
    db77cab [Aaron Davidson] [Minor] Fix test after changed config name

commit 2f2b837e33eca6010fad3ad22c7d298fa6d042c9
Author: Sean Owen <so...@cloudera.com>
Date:   2015-01-09T17:35:46Z

    SPARK-5136 [DOCS] Improve documentation around setting up Spark IntelliJ project
    
    This PR simply points to the IntelliJ wiki page instead of also including IntelliJ notes in the docs. The intent however is to also update the wiki page with updated tips. This is the text I propose for the IntelliJ section on the wiki. I realize it omits some of the existing instructions on the wiki, about enabling Hive, but I think those are actually optional.
    
    ------
    
    IntelliJ supports both Maven- and SBT-based projects. It is recommended, however, to import Spark as a Maven project. Choose "Import Project..." from the File menu, and select the `pom.xml` file in the Spark root directory.
    
    It is fine to leave all settings at their default values in the Maven import wizard, with two caveats. First, it is usually useful to enable "Import Maven projects automatically", sincchanges to the project structure will automatically update the IntelliJ project.
    
    Second, note the step that prompts you to choose active Maven build profiles. As documented above, some build configuration require specific profiles to be enabled. The same profiles that are enabled with `-P[profile name]` above may be enabled on this screen. For example, if developing for Hadoop 2.4 with YARN support, enable profiles `yarn` and `hadoop-2.4`.
    
    These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section.
    
    "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the  "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources.
    
    Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #3952 from srowen/SPARK-5136 and squashes the following commits:
    
    f3baa66 [Sean Owen] Point to new IJ / Eclipse wiki link
    016b7df [Sean Owen] Point to IntelliJ wiki page instead of also including IntelliJ notes in the docs

commit 37fea2dde60567baa69e031ed8a7895d1b923429
Author: Patrick Wendell <pw...@gmail.com>
Date:   2015-01-09T17:40:18Z

    HOTFIX: Minor improvements to make-distribution.sh
    
    1. Renames $FWDIR to $SPARK_HOME (vast majority of diff).
    2. Use Spark-provided Maven.
    3. Logs build flags in the RELEASE file.
    
    Author: Patrick Wendell <pw...@gmail.com>
    
    Closes #3973 from pwendell/master and squashes the following commits:
    
    340a2fa [Patrick Wendell] HOTFIX: Minor improvements to make-distribution.sh

commit 0a3aa5fac073e60d09a4afa2cd2a90f6faa2982c
Author: Kay Ousterhout <ka...@gmail.com>
Date:   2015-01-09T17:47:06Z

    [SPARK-1143] Separate pool tests into their own suite.
    
    The current TaskSchedulerImplSuite includes some tests that are
    actually for the TaskSchedulerImpl, but the remainder of the tests avoid using
    the TaskSchedulerImpl entirely, and actually test the pool and scheduling
    algorithm mechanisms. This commit separates the pool/scheduling algorithm
    tests into their own suite, and also simplifies those tests.
    
    The pull request replaces #339.
    
    Author: Kay Ousterhout <ka...@gmail.com>
    
    Closes #3967 from kayousterhout/SPARK-1143 and squashes the following commits:
    
    8a898c4 [Kay Ousterhout] [SPARK-1143] Separate pool tests into their own suite.

commit d2a450c8ab1669acfe6007ae87bec4dde60fea7e
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2015-01-09T18:27:33Z

    [SPARK-5145][Mllib] Add BLAS.dsyr and use it in GaussianMixtureEM
    
    This pr uses BLAS.dsyr to replace few implementations in GaussianMixtureEM.
    
    Author: Liang-Chi Hsieh <vi...@gmail.com>
    
    Closes #3949 from viirya/blas_dsyr and squashes the following commits:
    
    4e4d6cf [Liang-Chi Hsieh] Add unit test. Rename function name, modify doc and style.
    3f57fd2 [Liang-Chi Hsieh] Add BLAS.dsyr and use it in GaussianMixtureEM.

commit 831a0d287203392bead89d2c553919bb2fb4456a
Author: Jongyoul Lee <jo...@gmail.com>
Date:   2015-01-09T18:47:08Z

    [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688
    
    - update version from 0.18.1 to 0.21.0
    - I'm doing some tests in order to verify some spark jobs work fine on mesos 0.21.0 environment.
    
    Author: Jongyoul Lee <jo...@gmail.com>
    
    Closes #3934 from jongyoul/SPARK-3619 and squashes the following commits:
    
    ab994fa [Jongyoul Lee] [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - update version from 0.18.1 to 0.21.0

commit 40d8a94b1445e10a31f9dbbf7ff0757e7f159f2c
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2015-01-09T21:00:15Z

    [SPARK-5015] [mllib] Random seed for GMM + make test suite deterministic
    
    Issues:
    * From JIRA: GaussianMixtureEM uses randomness but does not take a random seed. It should take one as a parameter.
    * This also makes the test suite flaky since initialization can fail due to stochasticity.
    
    Fix:
    * Add random seed
    * Use it in test suite
    
    CC: mengxr  tgaloppo
    
    Author: Joseph K. Bradley <jo...@databricks.com>
    
    Closes #3981 from jkbradley/gmm-seed and squashes the following commits:
    
    f0df4fd [Joseph K. Bradley] Added seed parameter to GMM.  Updated test suite to use seed to prevent flakiness

commit 7884948b953161e8df6d6a97e8ec37f69f3597e3
Author: WangTaoTheTonic <ba...@aliyun.com>
Date:   2015-01-09T21:20:32Z

    [SPARK-1953][YARN]yarn client mode Application Master memory size is same as driver memory...
    
    ... size
    
    Ways to set Application Master's memory on yarn-client mode:
    1.  `spark.yarn.am.memory` in SparkConf or System Properties
    2.  default value 512m
    
    Note: this arguments is only available in yarn-client mode.
    
    Author: WangTaoTheTonic <ba...@aliyun.com>
    
    Closes #3607 from WangTaoTheTonic/SPARK4181 and squashes the following commits:
    
    d5ceb1b [WangTaoTheTonic] spark.driver.memeory is used in both modes
    6c1b264 [WangTaoTheTonic] rebase
    b8410c0 [WangTaoTheTonic] minor optiminzation
    ddcd592 [WangTaoTheTonic] fix the bug produced in rebase and some improvements
    3bf70cc [WangTaoTheTonic] rebase and give proper hint
    987b99d [WangTaoTheTonic] disable --driver-memory in client mode
    2b27928 [WangTaoTheTonic] inaccurate description
    b7acbb2 [WangTaoTheTonic] incorrect method invoked
    2557c5e [WangTaoTheTonic] missing a single blank
    42075b0 [WangTaoTheTonic] arrange the args and warn logging
    69c7dba [WangTaoTheTonic] rebase
    1960d16 [WangTaoTheTonic] fix wrong comment
    7fa9e2e [WangTaoTheTonic] log a warning
    f6bee0e [WangTaoTheTonic] docs issue
    d619996 [WangTaoTheTonic] Merge branch 'master' into SPARK4181
    b09c309 [WangTaoTheTonic] use code format
    ab16bb5 [WangTaoTheTonic] fix bug and add comments
    44e48c2 [WangTaoTheTonic] minor fix
    6fd13e1 [WangTaoTheTonic] add overhead mem and remove some configs
    0566bb8 [WangTaoTheTonic] yarn client mode Application Master memory size is same as driver memory size

commit a4f1946e4c42d1e350199b927018bfe9ed337929
Author: mcheah <mc...@palantir.com>
Date:   2015-01-09T22:16:20Z

    [SPARK-4737] Task set manager properly handles serialization errors
    
    Dealing with [SPARK-4737], the handling of serialization errors should not be the DAGScheduler's responsibility. The task set manager now catches the error and aborts the stage.
    
    If the TaskSetManager throws a TaskNotSerializableException, the TaskSchedulerImpl will return an empty list of task descriptions, because no tasks were started. The scheduler should abort the stage gracefully.
    
    Note that I'm not too familiar with this part of the codebase and its place in the overall architecture of the Spark stack. If implementing it this way will have any averse side effects please voice that loudly.
    
    Author: mcheah <mc...@palantir.com>
    
    Closes #3638 from mccheah/task-set-manager-properly-handle-ser-err and squashes the following commits:
    
    1545984 [mcheah] Some more style fixes from Andrew Or.
    5267929 [mcheah] Fixing style suggestions from Andrew Or.
    dfa145b [mcheah] Fixing style from Josh Rosen's feedback
    b2a430d [mcheah] Not returning empty seq when a task set cannot be serialized.
    94844d7 [mcheah] Fixing compilation error, one brace too many
    5f486f4 [mcheah] Adding license header for fake task class
    bf5e706 [mcheah] Fixing indentation.
    097e7a2 [mcheah] [SPARK-4737] Catching task serialization exception in TaskSetManager

commit 30f7f1744c6441fae1e8299a27046d06d105b2e6
Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
Date:   2015-01-09T22:40:45Z

    [DOC] Fixed Mesos version in doc from 0.18.1 to 0.21.0
    
    #3934 upgraded Mesos version so we should also fix docs right?
    
    This issue is really minor so I don't file in JIRA.
    
    Author: Kousuke Saruta <sa...@oss.nttdata.co.jp>
    
    Closes #3982 from sarutak/fix-mesos-version and squashes the following commits:
    
    9a86ee3 [Kousuke Saruta] Fixed mesos version from 0.18.1 to 0.21.0

commit a675d98ffec5054c1e0818b737609a34be9be983
Author: bilna <bi...@am.amrita.edu>
Date:   2015-01-09T22:45:28Z

    [Minor] Fix import order and other coding style
    
    fixed import order and other coding style
    
    Author: bilna <bi...@am.amrita.edu>
    Author: Bilna P <bi...@gmail.com>
    
    Closes #3966 from Bilna/master and squashes the following commits:
    
    5e76f04 [bilna] fix import order and other coding style
    5718d66 [bilna] Merge remote-tracking branch 'upstream/master'
    ae56514 [bilna] Merge remote-tracking branch 'upstream/master'
    acea3a3 [bilna] Adding dependency with scope test
    28681fa [bilna] Merge remote-tracking branch 'upstream/master'
    fac3904 [bilna] Correction in Indentation and coding style
    ed9db4c [bilna] Merge remote-tracking branch 'upstream/master'
    4b34ee7 [Bilna P] Update MQTTStreamSuite.scala
    04503cf [bilna] Added embedded broker service for mqtt test
    89d804e [bilna] Merge remote-tracking branch 'upstream/master'
    fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master'
    4b58094 [Bilna P] Update MQTTStreamSuite.scala
    b1ac4ad [bilna] Added BeforeAndAfter
    5f6bfd2 [bilna] Added BeforeAndAfter
    e8b6623 [Bilna P] Update MQTTStreamSuite.scala
    5ca6691 [Bilna P] Update MQTTStreamSuite.scala
    8616495 [bilna] [SPARK-4631] unit test for MQTT

commit 37a27b427dc7ae8fe731907472b38a2e5ff54ae8
Author: WangTaoTheTonic <ba...@aliyun.com>
Date:   2015-01-10T01:10:02Z

    [SPARK-4990][Deploy]to find default properties file, search SPARK_CONF_DIR first
    
    https://issues.apache.org/jira/browse/SPARK-4990
    
    Author: WangTaoTheTonic <ba...@aliyun.com>
    Author: WangTao <ba...@aliyun.com>
    
    Closes #3823 from WangTaoTheTonic/SPARK-4990 and squashes the following commits:
    
    133c43e [WangTao] Update spark-submit2.cmd
    b1ab402 [WangTao] Update spark-submit
    4cc7f34 [WangTaoTheTonic] rebase
    55300bc [WangTaoTheTonic] use export to make it global
    d8d3cb7 [WangTaoTheTonic] remove blank line
    07b9ebf [WangTaoTheTonic] check SPARK_CONF_DIR instead of checking properties file
    c5a85eb [WangTaoTheTonic] to find default properties file, search SPARK_CONF_DIR first

commit 0a9c325e6a2d0028c30f3e13e6bc6c7e71170929
Author: MechCoder <ma...@gmail.com>
Date:   2015-01-10T01:45:18Z

    [SPARK-4406] [MLib] FIX: Validate k in SVD
    
    Raise exception when k is non-positive in SVD
    
    Author: MechCoder <ma...@gmail.com>
    
    Closes #3945 from MechCoder/spark-4406 and squashes the following commits:
    
    64e6d2d [MechCoder] TST: Add better test errors and messages
    12dae73 [MechCoder] [SPARK-4406] FIX: Validate k in SVD

commit 29534b6bf401043123aba92473389946bb84946a
Author: luogankun <lu...@gmail.com>
Date:   2015-01-10T04:38:41Z

    [SPARK-5141][SQL]CaseInsensitiveMap throws java.io.NotSerializableException
    
    CaseInsensitiveMap throws java.io.NotSerializableException.
    
    Author: luogankun <lu...@gmail.com>
    
    Closes #3944 from luogankun/SPARK-5141 and squashes the following commits:
    
    b6d63d5 [luogankun] [SPARK-5141]CaseInsensitiveMap throws java.io.NotSerializableException

commit 5d2bb0fffeb2e3cae744b410b55cef99595f0af1
Author: Alex Liu <al...@yahoo.com>
Date:   2015-01-10T21:19:12Z

    [SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact
    
    Author: Alex Liu <al...@yahoo.com>
    
    Closes #3766 from alexliu68/SPARK-SQL-4925 and squashes the following commits:
    
    3137b51 [Alex Liu] [SPARK-4925][SQL] Remove sql/hive-thriftserver module from pom.xml
    15f2e38 [Alex Liu] [SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact

commit cf5686b922a90612cea185c882033989b391a021
Author: Alex Liu <al...@yahoo.com>
Date:   2015-01-10T21:23:09Z

    [SPARK-4943][SQL] Allow table name having dot for db/catalog
    
    The pull only fixes the parsing error and changes API to use tableIdentifier. Joining different catalog datasource related change is not done in this pull.
    
    Author: Alex Liu <al...@yahoo.com>
    
    Closes #3941 from alexliu68/SPARK-SQL-4943-3 and squashes the following commits:
    
    343ae27 [Alex Liu] [SPARK-4943][SQL] refactoring according to review
    29e5e55 [Alex Liu] [SPARK-4943][SQL] fix failed Hive CTAS tests
    6ae77ce [Alex Liu] [SPARK-4943][SQL] fix TestHive matching error
    3652997 [Alex Liu] [SPARK-4943][SQL] Allow table name having dot to support db/catalog ...

commit 37a79554360b7809a1b7413f831a8e91d68400d6
Author: scwf <wa...@huawei.com>
Date:   2015-01-10T21:53:21Z

    [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands.
    
    Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:
    ```
    CREATE TEMPORARY TABLE avroTable
    USING org.apache.spark.sql.avro
    OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
    ```
    With this PR user can define schema instead of infer from file, so  support ddl command as follows:
    ```
    CREATE TEMPORARY TABLE avroTable(a int, b string)
    USING org.apache.spark.sql.avro
    OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
    ```
    
    Author: scwf <wa...@huawei.com>
    Author: Yin Huai <yh...@databricks.com>
    Author: Fei Wang <wa...@huawei.com>
    Author: wangfei <wa...@huawei.com>
    
    Closes #3431 from scwf/ddl and squashes the following commits:
    
    7e79ce5 [Fei Wang] Merge pull request #22 from yhuai/pr3431yin
    38f634e [Yin Huai] Remove Option from createRelation.
    65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd.
    a852b10 [scwf] remove cleanIdentifier
    f336a16 [Fei Wang] Merge pull request #21 from yhuai/pr3431yin
    baf79b5 [Yin Huai] Test special characters quoted by backticks.
    50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType.
    1eeb769 [Fei Wang] Merge pull request #20 from yhuai/pr3431yin
    f5c22b0 [Yin Huai] Refactor code and update test cases.
    f1cffe4 [Yin Huai] Revert "minor refactory"
    b621c8f [scwf] minor refactory
    d02547f [scwf] fix HiveCompatibilitySuite test failure
    8dfbf7a [scwf] more tests for complex data type
    ddab984 [Fei Wang] Merge pull request #19 from yhuai/pr3431yin
    91ad91b [Yin Huai] Parse data types in DDLParser.
    cf982d2 [scwf] fixed test failure
    445b57b [scwf] address comments
    02a662c [scwf] style issue
    44eb70c [scwf] fix decimal parser issue
    83b6fc3 [scwf] minor fix
    9bf12f8 [wangfei] adding test case
    7787ec7 [wangfei] added SchemaRelationProvider
    0ba70df [wangfei] draft version

commit 94b489f8d3966f5133b75be4d79818a3b19a717d
Author: scwf <wa...@huawei.com>
Date:   2015-01-10T22:08:04Z

    [SPARK-4861][SQL] Refactory command in spark sql
    
    Follow up for #3712.
    This PR finally remove ```CommandStrategy``` and make all commands follow ```RunnableCommand``` so they can go with ```case r: RunnableCommand => ExecutedCommand(r) :: Nil```.
    
    One exception is the ```DescribeCommand``` of hive, which is a special case and need to distinguish hive table and temporary table, so still keep ```HiveCommandStrategy``` here.
    
    Author: scwf <wa...@huawei.com>
    
    Closes #3948 from scwf/followup-SPARK-4861 and squashes the following commits:
    
    6b48e64 [scwf] minor style fix
    2c62e9d [scwf] fix for hive module
    5a7a819 [scwf] Refactory command in spark sql

commit 447f643adf7ea2f89018ac380412eb5dc7133af5
Author: Yanbo Liang <ya...@gmail.com>
Date:   2015-01-10T22:16:37Z

    SPARK-4963 [SQL] Add copy to SQL's Sample operator
    
    https://issues.apache.org/jira/browse/SPARK-4963
    SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
    HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.
    
    override def next(): T = {
        val r = data.next()
        advance
        r
      }
    
    GapSamplingIterator.next() return the current underlying element and assigned it to r.
    However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
    After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.
    
    To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
    Further more, we need to investigate  GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.
    
    Author: Yanbo Liang <ya...@gmail.com>
    
    Closes #3827 from yanbohappy/spark-4963 and squashes the following commits:
    
    0912ca0 [Yanbo Liang] code format keep
    65c4e7c [Yanbo Liang] import file and clear annotation
    55c7c56 [Yanbo Liang] better output of test case
    cea7e2e [Yanbo Liang] SchemaRDD add copy operation before Sample operator
    e840829 [Yanbo Liang] HiveTableScan return mutable row with copy

commit 63729e175b4aa2ee25f05e2598785719c1e4acb7
Author: Michael Armbrust <mi...@databricks.com>
Date:   2015-01-10T22:25:45Z

    [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause
    
    Author: Michael Armbrust <mi...@databricks.com>
    
    Closes #3987 from marmbrus/hiveUdfCaching and squashes the following commits:
    
    8bca2fa [Michael Armbrust] [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: New spark

Posted by damnMeddlingKid <gi...@git.apache.org>.
Github user damnMeddlingKid closed the pull request at:

    https://github.com/apache/spark/pull/9342


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: New spark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9342#issuecomment-152015744
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org