You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by chlyzzo <gi...@git.apache.org> on 2017/07/18 07:39:42 UTC

[GitHub] spark pull request #18669: tfidf-new edit

GitHub user chlyzzo opened a pull request:

    https://github.com/apache/spark/pull/18669

    tfidf-new edit

    ## What changes were proposed in this pull request?
    
    i add a TfIdf.scala,it can compute docs tfidf's vector. i hava a case that is compute docs similarity,so i use the spark millib,the code is follow:
    ~~~bash
    val hashingTF = new HashingTF()
    val tf = hashingTF.transform(dataSeg)
    val idfIgnore = new IDF().fit(tf) 
    val tfidfIgnore= idfIgnore.transform(tf)
    val data = docIds.zip(tfidfIgnore)//RDD[(String,Vector)]
    ~~~
    but run in a small dataset,it can get result,but take much time,then in big dataset,it does not work(250000 document),the job does not get result in 1 hours.
    the spark config setting follow:
    ~~~bash
    --driver-memory 8G 
    --conf spark.yarn.executor.memoryOverhead=6144 
    --conf spark.akka.frameSize=300
    num-executors=20
    executor-cores=5
    executor-memory=10g
    ~~~
    so,i write the tdidf method by meself,and test dataset(250000 documents),it can get the result,
    ## How was this patch tested?
    
    i write the TfIdf.scala,it can compute doc tfidf value,and transfer the value to vector.then you can use cos similary.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18669.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18669
    
----
commit 7cb566abc27d41d5816dee16c6ecb749da2adf46
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-05-05T10:31:59Z

    [SPARK-19660][SQL] Replace the deprecated property name fs.default.name to fs.defaultFS that newly introduced
    
    ## What changes were proposed in this pull request?
    
    Replace the deprecated property name `fs.default.name` to `fs.defaultFS` that newly introduced.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Yuming Wang <wg...@gmail.com>
    
    Closes #17856 from wangyum/SPARK-19660.
    
    (cherry picked from commit 37cdf077cd3f436f777562df311e3827b0727ce7)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit dbb54a7b39568cc9e8046a86113b98c3c69b7d11
Author: jyu00 <je...@us.ibm.com>
Date:   2017-05-05T10:36:51Z

    [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode
    
    ## What changes were proposed in this pull request?
    
    Updated spark-class to turn off posix mode so the process substitution doesn't cause a syntax error.
    
    ## How was this patch tested?
    
    Existing unit tests, manual spark-shell testing with posix mode on
    
    Author: jyu00 <je...@us.ibm.com>
    
    Closes #17852 from jyu00/master.
    
    (cherry picked from commit 5773ab121d5d7cbefeef17ff4ac6f8af36cc1251)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 1fa3c86a740e072957a2104dbd02ca3c158c508d
Author: Jarrett Meyer <ja...@gmail.com>
Date:   2017-05-05T15:30:42Z

    [SPARK-20613] Remove excess quotes in Windows executable
    
    ## What changes were proposed in this pull request?
    
    Quotes are already added to the RUNNER variable on line 54. There is no need to put quotes on line 67. If you do, you will get an error when launching Spark.
    
    '""C:\Program' is not recognized as an internal or external command, operable program or batch file.
    
    ## How was this patch tested?
    
    Tested manually on Windows 10.
    
    Author: Jarrett Meyer <ja...@gmail.com>
    
    Closes #17861 from jarrettmeyer/fix-windows-cmd.
    
    (cherry picked from commit b9ad2d1916af5091c8585d06ccad8219e437e2bc)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit f71aea6a0be6eda24623d8563d971687ecd04caf
Author: Yucai <yu...@intel.com>
Date:   2017-05-05T16:51:57Z

    [SPARK-20381][SQL] Add SQL metrics of numOutputRows for ObjectHashAggregateExec
    
    ## What changes were proposed in this pull request?
    
    ObjectHashAggregateExec is missing numOutputRows, add this metrics for it.
    
    ## How was this patch tested?
    
    Added unit tests for the new metrics.
    
    Author: Yucai <yu...@intel.com>
    
    Closes #17678 from yucai/objectAgg_numOutputRows.
    
    (cherry picked from commit 41439fd52dd263b9f7d92e608f027f193f461777)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit 24fffacad709c553e0f24ae12a8cca3ab980af3c
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2017-05-05T18:08:26Z

    [SPARK-20603][SS][TEST] Set default number of topic partitions to 1 to reduce the load
    
    ## What changes were proposed in this pull request?
    
    I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal topic `__consumer_offsets`. As Kafka creates this topic lazily, the topic creation happens in the first test `deserialization of initial offset with Spark 2.1.0` and causes it timeout.
    
    This PR changes `offsets.topic.num.partitions` from the default value 50 to 1 to make creating `__consumer_offsets` (50 partitions -> 1 partition) much faster.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #17863 from zsxwing/fix-kafka-flaky-test.
    
    (cherry picked from commit bd5788287957d8610a6d19c273b75bd4cdd2d166)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit f59c74a9460b0db4e6c3ecbe872e2eaadc43e2cc
Author: Michael Patterson <ma...@gmail.com>
Date:   2017-04-23T02:58:54Z

    [SPARK-20132][DOCS] Add documentation for column string functions
    
    ## What changes were proposed in this pull request?
    Add docstrings to column.py for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op`
    
    There may be a better place to put the docstrings. I put them immediately above the Column class.
    
    ## How was this patch tested?
    
    I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer.
    
    These docstrings are my original work and free license.
    
    davies has done the most recent work reorganizing `_bin_op`
    
    Author: Michael Patterson <ma...@gmail.com>
    
    Closes #17469 from map222/patterson-documentation.

commit 1d9b7a74a839021814ab28d3eba3636c64483130
Author: Juliusz Sompolski <ju...@databricks.com>
Date:   2017-05-05T22:31:06Z

    [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch
    
    ## What changes were proposed in this pull request?
    
    Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch.
    
    ## How was this patch tested?
    
    Now the debug message prints the diff between start and end of batch.
    
    Author: Juliusz Sompolski <ju...@databricks.com>
    
    Closes #17875 from juliuszsompolski/SPARK-20616.
    
    (cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit 423a78625620523ab6a51b2274548a985fc18ed0
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-04-27T07:34:20Z

    [SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide
    
    ## What changes were proposed in this pull request?
    
    Add `spark.fpGrowth` to SparkR programming guide.
    
    ## How was this patch tested?
    
    Manual tests.
    
    Author: zero323 <ze...@users.noreply.github.com>
    
    Closes #17775 from zero323/SPARK-20208-FOLLOW-UP.
    
    (cherry picked from commit ba7666274e71f1903e5050a5e53fbdcd21debde5)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 048e9890ca6e67c40d298b5dda20742790f5530c
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-05-07T20:10:10Z

    [SPARK-20543][SPARKR][FOLLOWUP] Don't skip tests on AppVeyor
    
    ## What changes were proposed in this pull request?
    
    add environment
    
    ## How was this patch tested?
    
    wait for appveyor run
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #17878 from felixcheung/appveyorrcran.
    
    (cherry picked from commit 7087e01194964a1aad0b45bdb41506a17100eacf)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 6c5b7e106895302a87cf6522d3c64c3badac699f
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-05-08T06:10:18Z

    [SPARK-20626][SPARKR] address date test warning with timezone on windows
    
    ## What changes were proposed in this pull request?
    
    set timezone on windows
    
    ## How was this patch tested?
    
    unit test, AppVeyor
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #17892 from felixcheung/rtimestamptest.
    
    (cherry picked from commit c24bdaab5a234d18b273544cefc44cc4005bf8fc)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit d8a5a0d3420abbb911d8a80dc7165762eb08d779
Author: Wayne Zhang <ac...@uber.com>
Date:   2017-05-08T06:16:30Z

    [SPARKR][DOC] fix typo in vignettes
    
    ## What changes were proposed in this pull request?
    Fix typo in vignettes
    
    Author: Wayne Zhang <ac...@uber.com>
    
    Closes #17884 from actuaryzhang/typo.
    
    (cherry picked from commit 2fdaeb52bbe2ed1a9127ac72917286e505303c85)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 7b9d05ad00455daa53ae4ef1a602a6c64c2c95a4
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-08T10:45:00Z

    [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases
    
    Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`.
    
    ## How was this patch tested?
    
    Updated existing unit tests.
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #17860 from MLnick/SPARK-20596-als-rec-tests.
    
    (cherry picked from commit 58518d070777fc0665c4d02bad8adf910807df98)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit 23681e9ca0042328f93962701d19ca371727b0b7
Author: Xianyang Liu <xi...@intel.com>
Date:   2017-05-08T17:25:24Z

    [SPARK-20621][DEPLOY] Delete deprecated config parameter in 'spark-env.sh'
    
    ## What changes were proposed in this pull request?
    
    Currently, `spark.executor.instances` is deprecated in `spark-env.sh`, because we suggest config it in `spark-defaults.conf` or other config file. And also this parameter is useless even if you set it in `spark-env.sh`, so remove it in this patch.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Xianyang Liu <xi...@intel.com>
    
    Closes #17881 from ConeyLiu/deprecatedParam.
    
    (cherry picked from commit aeb2ecc0cd898f5352df0a04be1014b02ea3e20e)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 4179ffc031a0dbca6a93255c673de800ce7393fe
Author: Hossein <ho...@databricks.com>
Date:   2017-05-08T21:48:11Z

    [SPARK-20661][SPARKR][TEST] SparkR tableNames() test fails
    
    ## What changes were proposed in this pull request?
    Cleaning existing temp tables before running tableNames tests
    
    ## How was this patch tested?
    SparkR Unit tests
    
    Author: Hossein <ho...@databricks.com>
    
    Closes #17903 from falaki/SPARK-20661.
    
    (cherry picked from commit 2abfee18b6511482b916c36f00bf3abf68a59e19)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 54e07434968624dbb0fb80773356e614b954e52f
Author: Felix Cheung <fe...@hotmail.com>
Date:   2017-05-09T05:49:40Z

    [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames() test fails
    
    ## What changes were proposed in this pull request?
    
    Change it to check for relative count like in this test https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355 for catalog APIs
    
    ## How was this patch tested?
    
    unit tests, this needs to combine with another commit with SQL change to check
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #17905 from felixcheung/rtabletests.
    
    (cherry picked from commit b952b44af4d243f1e3ad88bccf4af7d04df3fc81)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 72fca9a0a7a6dd2ab7c338fab9666b51cd981cce
Author: Peng <pe...@intel.com>
Date:   2017-05-09T08:05:49Z

    [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll
    
    The recommendForAll of MLLIB ALS is very slow.
    GC is a key problem of the current method.
    The task use the following code to keep temp result:
    val output = new Array[(Int, (Int, Double))](m*n)
    m = n = 4096 (default value, no method to set)
    so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM.
    
    Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result.
    
    The Test Environment:
    3 workers: each work 10 core, each work 30G memory, each work 1 executor.
    The Data: User 480,000, and Item 17,000
    
    BlockSize:     1024  2048  4096  8192
    Old method:  245s  332s  488s  OOM
    This solution: 121s  118s   117s  120s
    
    The existing UT.
    
    Author: Peng <pe...@intel.com>
    Author: Peng Meng <pe...@intel.com>
    
    Closes #17742 from mpjlu/OptimizeAls.
    
    (cherry picked from commit 8079424763c2043264f30a6898ce964379bd9b56)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit ca3f7edbad6a2e7fcd1c1d3dbd1a522cd0d7c476
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-09T08:13:15Z

    [SPARK-20587][ML] Improve performance of ML ALS recommendForAll
    
    This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Nick Pentreath <ni...@za.ibm.com>
    
    Closes #17845 from MLnick/ml-als-perf.
    
    (cherry picked from commit 10b00abadf4a3473332eef996db7b66f491316f2)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit 4bbfad44e426365ad9f4941d68c110523b17ea6d
Author: Jon McLean <jo...@atsid.com>
Date:   2017-05-09T08:47:50Z

    [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException
    
    ## What changes were proposed in this pull request?
    
    Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.
    
    ## How was this patch tested?
    
    Tests were added to the existing VectorsSuite to cover this case.
    
    Author: Jon McLean <jo...@atsid.com>
    
    Closes #17877 from jonmclean/vectorArgmaxIndexBug.
    
    (cherry picked from commit be53a78352ae7c70d8a07d0df24574b3e3129b4a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 4b7aa0b1dbd85e2238acba45e8f94c097358fb72
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-05-09T09:30:37Z

    [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
    
    ## What changes were proposed in this pull request?
    Remove ML methods we deprecated in 2.1.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <yb...@gmail.com>
    
    Closes #17867 from yanboliang/spark-20606.
    
    (cherry picked from commit b8733e0ad9f5a700f385e210450fd2c10137293e)
    Signed-off-by: Yanbo Liang <yb...@gmail.com>

commit b3309676bb83a80d38b916066d046866a6f42ef0
Author: Xiao Li <ga...@gmail.com>
Date:   2017-05-09T12:10:50Z

    [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
    
    ## What changes were proposed in this pull request?
    
    So far, we do not drop all the cataloged objects after each package. Sometimes, we might hit strange test case errors because the previous test suite did not drop the cataloged/temporary objects (tables/functions/database). At least, we can first clean up the environment when completing the package of `sql/core` and `sql/hive`.
    
    ## How was this patch tested?
    N/A
    
    Author: Xiao Li <ga...@gmail.com>
    
    Closes #17908 from gatorsmile/reset.
    
    (cherry picked from commit 0d00c768a860fc03402c8f0c9081b8147c29133e)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 272d2a10d70588e1f80cc6579d4ec3c44b5bbfc2
Author: Takeshi Yamamuro <ya...@apache.org>
Date:   2017-05-09T12:22:51Z

    [SPARK-20311][SQL] Support aliases for table value functions
    
    ## What changes were proposed in this pull request?
    This pr added parsing rules to support aliases in table value functions.
    
    ## How was this patch tested?
    Added tests in `PlanParserSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #17666 from maropu/SPARK-20311.
    
    (cherry picked from commit 714811d0b5bcb5d47c39782ff74f898d276ecc59)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 08e1b78f01955c7151d9e984d392d45deced6e34
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-05-09T16:09:35Z

    [SPARK-20548][FLAKY-TEST] share one REPL instance among REPL test cases
    
    `ReplSuite.newProductSeqEncoder with REPL defined class` was flaky and throws OOM exception frequently. By analyzing the heap dump, we found the reason is that, in each test case of `ReplSuite`, we create a REPL instance, which creates a classloader and loads a lot of classes related to `SparkContext`. More details please see https://github.com/apache/spark/pull/17833#issuecomment-298711435.
    
    In this PR, we create a new test suite, `SingletonReplSuite`, which shares one REPL instances among all the test cases. Then we move most of the tests from `ReplSuite` to `SingletonReplSuite`, to avoid creating a lot of REPL instances and reduce memory footprint.
    
    test only change
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #17844 from cloud-fan/flaky-test.
    
    (cherry picked from commit f561a76b2f895dea52f228a9376948242c3331ad)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 73aa23b8ef64960e7f171aa07aec396667a2339d
Author: Reynold Xin <rx...@databricks.com>
Date:   2017-05-09T16:24:28Z

    [SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF
    
    ## What changes were proposed in this pull request?
    For some reason we don't have an API to register UserDefinedFunction as named UDF. It is a no brainer to add one, in addition to the existing register functions we have.
    
    ## How was this patch tested?
    Added a test case in UDFSuite for the new API.
    
    Author: Reynold Xin <rx...@databricks.com>
    
    Closes #17915 from rxin/SPARK-20674.
    
    (cherry picked from commit d099f414d2cb53f5a61f6e77317c736be6f953a0)
    Signed-off-by: Xiao Li <ga...@gmail.com>

commit c7bd909f67209b4d1354c3d5b0a0fb1d4e28f205
Author: Sean Owen <so...@cloudera.com>
Date:   2017-05-09T17:22:23Z

    [SPARK-19876][BUILD] Move Trigger.java to java source hierarchy
    
    ## What changes were proposed in this pull request?
    
    Simply moves `Trigger.java` to `src/main/java` from `src/main/scala`
    See https://github.com/apache/spark/pull/17219
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #17921 from srowen/SPARK-19876.2.
    
    (cherry picked from commit 25ee816e090c42f0e35be2d2cb0f8ec60726317c)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 9e8d23b3a2f99985ffb3c4eb67ac0a2774fa5b02
Author: Holden Karau <ho...@us.ibm.com>
Date:   2017-05-09T18:25:29Z

    [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version
    
    ## What changes were proposed in this pull request?
    
    Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.
    
    ## How was this patch tested?
    
    Ran `make-distribution` locally
    
    Author: Holden Karau <ho...@us.ibm.com>
    
    Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
    
    (cherry picked from commit 1b85bcd9298cf84dd746fe8e91ab0b0df69ef17e)
    Signed-off-by: Holden Karau <ho...@us.ibm.com>

commit d191b962dc81c015fa92a38d882a8c7ea620ef06
Author: Yin Huai <yh...@databricks.com>
Date:   2017-05-09T21:47:45Z

    Revert "[SPARK-20311][SQL] Support aliases for table value functions"
    
    This reverts commit 714811d0b5bcb5d47c39782ff74f898d276ecc59.

commit 7600a7ab65777a59f3a33edef40328b6a5d864ef
Author: uncleGen <hu...@gmail.com>
Date:   2017-05-09T22:08:09Z

    [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
    
    ## What changes were proposed in this pull request?
    
    Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
    The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.
    
    Changes:
    - In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.
    
    Depends upon:
    - [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.
    
    Others:
    - A typo fix in example.
    
    ## How was this patch tested?
    
    add new unit test.
    
    Author: uncleGen <hu...@gmail.com>
    
    Closes #17896 from uncleGen/SPARK-20373.
    
    (cherry picked from commit c0189abc7c6ddbecc1832d2ff0cfc5546a010b60)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 6a996b36283dcd22ff7aa38968a80f575d2f151e
Author: Yuming Wang <wg...@gmail.com>
Date:   2017-05-10T02:45:00Z

    [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when calling createJoinKey
    
    ## What changes were proposed in this pull request?
    
    The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT > 1310720`:
    ```sql
    CREATE TABLE tab1(int int, int2 int, str string);
    CREATE TABLE tab2(int int, int2 int, str string);
    INSERT INTO tab1 values(1,1,'str');
    INSERT INTO tab1 values(2,2,'str');
    INSERT INTO tab2 values(1,1,'str');
    INSERT INTO tab2 values(2,3,'str');
    
    SELECT
      count(*)
    FROM
      (
        SELECT t1.int, t2.int2
        FROM (SELECT * FROM tab1 LIMIT 1310721) t1
        INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2
        ON (t1.int = t2.int AND t1.int2 = t2.int2)
      ) t;
    ```
    
    This pull request fix this issue.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Yuming Wang <wg...@gmail.com>
    
    Closes #17920 from wangyum/SPARK-17685.
    
    (cherry picked from commit 771abeb46f637592aba2e63db2ed05b6cabfd0be)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 7b6f3a118e973216264bbf356af2bb1e7870466e
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-05-10T05:44:47Z

    [SPARK-20590][SQL] Use Spark internal datasource if multiples are found for the same shorten name
    
    ## What changes were proposed in this pull request?
    
    One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.
    
    As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see databricks/spark-csv#367):
    
    ```bash
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    scala> val df = spark.read.csv("/foo/bar.csv")
    java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
      at scala.sys.package$.error(package.scala:27)
      at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
      at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
      at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
      ... 48 elided
    ```
    
    This PR proposes a simple way of fixing this error by picking up the internal datasource if there is single (the datasource that has "org.apache.spark" prefix).
    
    ```scala
    scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    ```scala
    scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    ## How was this patch tested?
    
    Manually tested as below:
    
    ```bash
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    ```
    
    ```scala
    spark.sparkContext.setLogLevel("WARN")
    ```
    
    **positive cases**:
    
    ```scala
    scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    ```scala
    scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    (newlines were inserted for readability).
    
    ```scala
    scala> spark.range(1).write.format("com.databricks.spark.csv").mode("overwrite").save("/tmp/abc")
    ```
    
    ```scala
    scala> spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").mode("overwrite").save("/tmp/abc")
    ```
    
    **negative cases**:
    
    ```scala
    scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc")
    java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
    ...
    ```
    
    ```scala
    scala> spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc")
    java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv.CsvRelatio. Please find packages at http://spark.apache.org/third-party-projects.html
    ...
    ```
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #17916 from HyukjinKwon/datasource-detect.
    
    (cherry picked from commit 3d2131ab4ddead29601fb3c597b798202ac25fdd)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit ef50a954882fa1911f7ede3f0aefc8fcf09c6059
Author: Josh Rosen <jo...@databricks.com>
Date:   2017-05-10T06:36:36Z

    [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping
    
    ## What changes were proposed in this pull request?
    
    The query
    
    ```
    SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
    ```
    
    should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows.
    
    This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead:
    
    An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows.
    
    If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation.
    
    The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be).
    
    This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation.
    
    ## How was this patch tested?
    
    - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
    - Updated unit tests in `PropagateEmptyRelationSuite`.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
    
    (cherry picked from commit a90c5cd8226146a58362732171b92cb99a7bc4c7)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18669: tfidf-new edit

Posted by chlyzzo <gi...@git.apache.org>.

Github user chlyzzo closed the pull request at:

    https://github.com/apache/spark/pull/18669


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18669: tfidf-new edit

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18669
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18669: tfidf-new edit

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18669
  
    @chlyzzo  close this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18669: tfidf-new edit

Posted by chlyzzo <gi...@git.apache.org>.

Github user chlyzzo commented on the issue:

    https://github.com/apache/spark/pull/18669
  
    closed,
    ----- 原始邮件 -----
    发件人：Sean Owen <no...@github.com>
    收件人：apache/spark <sp...@noreply.github.com>
    抄送人：chlyzzo <ri...@sina.cn>,  Mention <me...@noreply.github.com>
    主题：Re: [apache/spark] tfidf-new edit (#18669)
    日期：2017年07月18日 15点41分
    
    @chlyzzo  close this
    
    &mdash;
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub, or mute the thread.
    
    
      
      
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org