You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by nischay21 <gi...@git.apache.org> on 2017/03/10 07:43:42 UTC

[GitHub] spark pull request #17239: Using map function in spark for huge operation

GitHub user nischay21 opened a pull request:

    https://github.com/apache/spark/pull/17239

    Using map function in spark for huge operation

    We need to calculate distance matrix like jaccard on huge collection of Dataset in spark.
    Facing couple of issues. Kindly help us to give directions.
    
    Issue 1.
    
    			import info.debatty.java.stringsimilarity.Jaccard;
    			//sample Data set creation
    			List<Row> data = Arrays.asList(
    			RowFactory.create("Hi I heard about Spark", "Hi I Know about Spark"),
    			RowFactory.create("I wish Java could use case classes","I wish C# could use case classes"),
    			RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat"));
    			
    			StructType schema = new StructType(new StructField[] {new StructField("label", DataTypes.StringType, false,Metadata.empty()),
    			new StructField("sentence", DataTypes.StringType, false,Metadata.empty()) });
    			Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
    			
    			// Distance matrix object creation
    			Jaccard jaccard=new Jaccard();
    
    			//Working on each of the member element of dataset and applying distance matrix.
    			Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
    					(MapFunction<Row, String>) row -> "Name: " + jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING()
    			);
    			sentenceDataFrame1.show();
    
    No compile time errors. But getting run time exception like org.apache.spark.SparkException: Task not serializable
    
    Issue 2.
    Moreover we need to find which pair is having highest score for which we need to declare some variables. Also we need to perform other calculation as well, we are facing lots of difficulty. Even if I try to declare a simple variable like counter within MapBlock we are not able to capture the incremented value. If we declare outside the Map block we are getting lots of compile time errors.
    		
    		
    		int counter=0;
    			Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
    				(MapFunction<Row,  String>) row -> {
    						System.out.println("Name: " + row.getString(1));
    						//int counter = 0;
    						counter++;
    						System.out.println("Counter: " + counter);
    						return counter+"";								
    				},Encoders.STRING()							
    			);
    			
    Please gives us directions.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17239.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17239
    
----
commit 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-09T01:58:44Z

    [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1)
    
    ## What changes were proposed in this pull request?
    
    Backport #16203 to branch 2.1.
    
    ## How was this patch tested?
    
    Jennkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16216 from zsxwing/SPARK-18774-2.1.

commit ef5646b4c6792a96e85d1dd4bb3103ba8306949b
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2016-12-09T02:26:54Z

    [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution
    
    ## What changes were proposed in this pull request?
    
    Fixes name of R source package so that the `cp` in release-build.sh works correctly.
    
    Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125
    
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    
    Closes #16221 from shivaram/fix-sparkr-release-build-name.
    
    (cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 4ceed95b43d0cd9665004865095a40926efcc289
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2016-12-09T06:08:19Z

    [SPARK-18349][SPARKR] Update R API documentation on ml model summary
    
    ## What changes were proposed in this pull request?
    In this PR, the document of `summary` method is improved in the format:
    
    returns summary information of the fitted model, which is a list. The list includes .......
    
    Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
    
    In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
    
    Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
    
    ## How was this patch tested?
    
    Manual build.
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #16150 from wangmiao1981/audit2.
    
    (cherry picked from commit 86a96034ccb47c5bba2cd739d793240afcfc25f6)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit e8f351f9a670fc4d43f15c8d7cd57e49fb9ceba2
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2016-12-09T06:21:24Z

    Copy the SparkR source package with LFTP
    
    This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP
    
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    
    Closes #16226 from shivaram/fix-sparkr-copy-build.
    
    (cherry picked from commit 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 2c88e1dc31e1b90605ad8ab85b20b131b4b3c722
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-12-09T06:52:34Z

    Copy pyspark and SparkR packages to latest release dir too
    
    ## What changes were proposed in this pull request?
    
    Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822)
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16227 from felixcheung/pyrftp.
    
    (cherry picked from commit c074c96dc57bf18b28fafdcac0c768d75c642cba)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 72bf5199738c7ab0361b2b55eb4f4299048a21fa
Author: Zhan Zhang <zh...@fb.com>
Date:   2016-12-09T08:35:06Z

    [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic
    
    Make stateful udf as nondeterministic
    
    Add new test cases with both Stateful and Stateless UDF.
    Without the patch, the test cases will throw exception:
    
    1 did not equal 10
    ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501)
    org.scalatest.exceptions.TestFailedException: 1 did not equal 10
            at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
            at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
            ...
    
    Author: Zhan Zhang <zh...@fb.com>
    
    Closes #16068 from zhzhan/state.
    
    (cherry picked from commit 67587d961d5f94a8639c20cb80127c86bf79d5a8)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit b226f10e3df8b789da6ef820b256f994b178fbbe
Author: Jacek Laskowski <ja...@japila.pl>
Date:   2016-12-09T10:45:57Z

    [MINOR][CORE][SQL][DOCS] Typo fixes
    
    ## What changes were proposed in this pull request?
    
    Typo fixes
    
    ## How was this patch tested?
    
    Local build. Awaiting the official build.
    
    Author: Jacek Laskowski <ja...@japila.pl>
    
    Closes #16144 from jaceklaskowski/typo-fixes.
    
    (cherry picked from commit b162cc0c2810c1a9fa2eee8e664ffae84f9eea11)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 0c6415aeca7a5c2fc5462c483c60d770f0236efe
Author: Xiangrui Meng <me...@databricks.com>
Date:   2016-12-09T15:51:46Z

    [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend
    
    ## What changes were proposed in this pull request?
    
    * This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly.
    * I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong.
    * Small refactor of `SerDe.sqlSerDe` to increase readability.
    
    ## How was this patch tested?
    
    * Added unit tests for `JVMObjectTracker`.
    * Wait for Jenkins to run full tests.
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #16154 from mengxr/SPARK-17822.
    
    (cherry picked from commit fd48d80a6145ea94f03e7fc6e4d724a0fbccac58)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit eb2d9bfd4e100789604ca0810929b42694ea7377
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2016-12-09T18:12:56Z

    [MINOR][SPARKR] Fix SparkR regex in copy command
    
    Fix SparkR package copy regex. The existing code leads to
    ```
    Copying release tarballs to /home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f-bin
    mput: SparkR-*: no files found
    ```
    
    Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
    
    Closes #16231 from shivaram/typo-sparkr-build.
    
    (cherry picked from commit be5fc6ef72c7eb586b184b0f42ac50ef32843208)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 562507ef038f09ff422e9831416af5119282a9d0
Author: Kazuaki Ishizaki <is...@jp.ibm.com>
Date:   2016-12-09T22:13:36Z

    [SPARK-18745][SQL] Fix signed integer overflow due to toInt cast
    
    ## What changes were proposed in this pull request?
    
    This PR avoids that a result of a cast `toInt` is negative due to signed integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs casts after we can ensure the value is within range of signed integer (the result of `max(array.length, ???)` is always integer).
    
    ## How was this patch tested?
    
    Manually executed query68 of TPC-DS with 100TB
    
    Author: Kazuaki Ishizaki <is...@jp.ibm.com>
    
    Closes #16235 from kiszk/SPARK-18745.
    
    (cherry picked from commit d60ab5fd9b6af9aa5080a2d13b3589d8b79c5c5c)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit e45345d91e333e0b5f9219e857affeda461863c6
Author: Xiangrui Meng <me...@databricks.com>
Date:   2016-12-10T01:34:52Z

    [SPARK-18812][MLLIB] explain "Spark ML"
    
    ## What changes were proposed in this pull request?
    
    There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion.
    
    I check the [Spark FAQ page](http://spark.apache.org/faq.html), which seems too high-level for the content here. So I added it to the MLlib user guide instead.
    
    cc: mateiz
    
    Author: Xiangrui Meng <me...@databricks.com>
    
    Closes #16241 from mengxr/SPARK-18812.
    
    (cherry picked from commit d2493a203e852adf63dde4e1fc993e8d11efec3d)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 8bf56cc46b96874565ebd8109f62e69e6c0cf151
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-12-10T03:06:05Z

    [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values
    
    ## What changes were proposed in this pull request?
    
    Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
    example:
    ```
    > setLogLevel("WARN")
    NULL
    ```
    We should fix this to make the result more clear.
    
    Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
    
    ## How was this patch tested?
    
    manually - I didn't find a expect_*() method in testthat for this
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16237 from felixcheung/rinvis.
    
    (cherry picked from commit 3e11d5bfef2f05bd6d42c4d6188eae6d63c963ef)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit b020ce408507d7fd57f6d357054a2b3530a5b95e
Author: Burak Yavuz <br...@gmail.com>
Date:   2016-12-10T06:49:51Z

    [SPARK-18811] StreamSource resolution should happen in stream execution thread
    
    ## What changes were proposed in this pull request?
    
    When you start a stream, if we are trying to resolve the source of the stream, for example if we need to resolve partition columns, this could take a long time. This long execution time should not block the main thread where `query.start()` was called on. It should happen in the stream execution thread possibly before starting any triggers.
    
    ## How was this patch tested?
    
    Unit test added. Made sure test fails with no code changes.
    
    Author: Burak Yavuz <br...@gmail.com>
    
    Closes #16238 from brkyvz/SPARK-18811.
    
    (cherry picked from commit 63c9159870ee274c68e24360594ca01d476b9ace)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 2b36f4943051fafea0b12b662b4f4dab54806d26
Author: Huaxin Gao <hu...@us.ibm.com>
Date:   2016-12-10T14:41:40Z

    [SPARK-17460][SQL] Make sure sizeInBytes in Statistics will not overflow
    
    ## What changes were proposed in this pull request?
    
    1. In SparkStrategies.canBroadcast, I will add the check   plan.statistics.sizeInBytes >= 0
    2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow.
    
    ## How was this patch tested?
    
    I will add a test case to make sure the statistics.sizeInBytes won't overflow.
    
    Author: Huaxin Gao <hu...@us.ibm.com>
    
    Closes #16175 from huaxingao/spark-17460.
    
    (cherry picked from commit c5172568b59b4cf1d3dc7ed8c17a9bea2ea2ab79)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 83822df02fcd541068dd9cd462293f3cddfb6631
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-12-10T16:40:10Z

    [MINOR][DOCS] Remove Apache Spark Wiki address
    
    ## What changes were proposed in this pull request?
    
    According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links.
    
    ```
    All current wiki content has been merged into pages at http://spark.apache.org as of November 2016.
    Each page links to the new location of its information on the Spark web site.
    Obsolete wiki content is still hosted here, but carries a notice that it is no longer current.
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    - `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme
    - `docs/index.md`:
    ```
    cd docs
    SKIP_API=1 jekyll build
    ```
    ![screen shot 2016-12-09 at 2 53 29 pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png)
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #16239 from dongjoon-hyun/remove_wiki_from_readme.
    
    (cherry picked from commit f3a3fed76cb74ecd0f46031f337576ce60f54fb2)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 5151dafaaa6533ea88f7173c136e004ad87abd04
Author: Michal Senkyr <mi...@gmail.com>
Date:   2016-12-10T19:54:07Z

    [SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building with Java 8
    
    ## What changes were proposed in this pull request?
    
    The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc.
    
    Replace `>` with literals in Javadoc to allow the build to pass.
    
    ## How was this patch tested?
    
    Documentation was built and inspected manually to ensure it still displays correctly in the browser
    
    ```
    cd docs && jekyll serve
    ```
    
    Author: Michal Senkyr <mi...@gmail.com>
    
    Closes #16201 from michalsenkyr/javadoc8-gt-fix.
    
    (cherry picked from commit 114324832abce1fbb2c5f5b84a66d39dd2d4398a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit de21ca46e5d992dd950b6dcec71d7aee0cf6532e
Author: wangzhenhua <wa...@huawei.com>
Date:   2016-12-11T05:25:29Z

    [SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary column having only null values
    
    ## What changes were proposed in this pull request?
    
    During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null.
    
    ## How was this patch tested?
    
    Add a test for handling null columns
    
    Author: wangzhenhua <wa...@huawei.com>
    
    Closes #16243 from wzhfy/nullStats.
    
    (cherry picked from commit a29ee55aaadfe43ac9abb0eaf8b022b1e6d7babb)
    Signed-off-by: Reynold Xin <rx...@databricks.com>

commit d4c03f8769f063b0dfac7d000513a2bc20989549
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-12-11T09:12:46Z

    [SQL][MINOR] simplify a test to fix the maven tests
    
    ## What changes were proposed in this pull request?
    
    After https://github.com/apache/spark/pull/15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet.
    
    Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation.
    
    ## How was this patch tested?
    
    test only change
    
    Author: Wenchen Fan <we...@databricks.com>
    
    Closes #16244 from cloud-fan/minor.
    
    (cherry picked from commit 9abd05b6b94eda31c47bce1f913af988c35f1cb1)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit d5f14168d39433a02d065206c3910595339ff3dc
Author: krishnakalyan3 <kr...@gmail.com>
Date:   2016-12-11T09:28:16Z

    [SPARK-18628][ML] Update Scala param and Python param to have quotes
    
    ## What changes were proposed in this pull request?
    
    Updated Scala param and Python param to have quotes around the options making it easier for users to read.
    
    ## How was this patch tested?
    
    Manually checked the docstrings
    
    Author: krishnakalyan3 <kr...@gmail.com>
    
    Closes #16242 from krishnakalyan3/doc-string.
    
    (cherry picked from commit c802ad87182520662be51eb611ea1c64f4874c4e)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 63693c17e4407ec61052553d563218787c6f0dd6
Author: Tyson Condie <tc...@gmail.com>
Date:   2016-12-12T07:38:31Z

    [SPARK-18790][SS] Keep a general offset history of stream batches
    
    ## What changes were proposed in this pull request?
    
    Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
    the offsets that are present in each batch
    versions of the state store
    the files lists stored for the FileStreamSource
    the metadata log stored by the FileStreamSink
    
    marmbrus zsxwing
    
    ## How was this patch tested?
    
    The following tests were added.
    
    ### StreamExecution offset metadata
    Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain
    
    ### CompactibleFileStreamLog
    Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Tyson Condie <tc...@gmail.com>
    
    Closes #16219 from tcondie/offset_hist.
    
    (cherry picked from commit 83a42897ae90d84a54373db386a985e3e2d5903a)
    Signed-off-by: Shixiong Zhu <sh...@databricks.com>

commit 35011608f492ddcb19144954ba96c45ca6f87784
Author: Bill Chambers <bi...@databricks.com>
Date:   2016-12-12T13:33:17Z

    [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed
    
    ## What changes were proposed in this pull request?
    
    This PR clarifies where accumulators will be displayed.
    
    ## How was this patch tested?
    
    No testing.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Bill Chambers <bi...@databricks.com>
    Author: anabranch <wa...@gmail.com>
    Author: Bill Chambers <wc...@ischool.berkeley.edu>
    
    Closes #16180 from anabranch/improve-acc-docs.
    
    (cherry picked from commit 70ffff21f769b149bee787fe5901d9844a4d97b8)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 523071f3fae72909b64c7f405868bbc85f5c3cde
Author: Yuming Wang <wg...@gmail.com>
Date:   2016-12-12T22:38:36Z

    [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int
    
    ## What changes were proposed in this pull request?
    
    Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
    
    ## How was this patch tested?
    
    The existing tests.
    
    Author: Yuming Wang <wg...@gmail.com>
    
    Closes #16122 from wangyum/SPARK-18681.
    
    (cherry picked from commit 90abfd15f4b3f612a7b0ff65f03bf319c78a0243)
    Signed-off-by: Herman van Hovell <hv...@databricks.com>

commit 1aeb7f427d31bfd44f7abb7c56dd7661be8bbaa6
Author: Felix Cheung <fe...@hotmail.com>
Date:   2016-12-12T22:40:41Z

    [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots
    
    ## What changes were proposed in this pull request?
    
    Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
    
    ## How was this patch tested?
    
    unit test, manually testing
    - snapshot build url
      - download when spark jar not cached
      - when spark jar is cached
    - RC build url
      - download when spark jar not cached
      - when spark jar is cached
    - multiple cached spark versions
    - starting with sparkR shell
    
    To use this,
    ```
    SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
    ```
    then in R,
    ```
    library(SparkR) # or specify lib.loc
    sparkR.session()
    ```
    
    Author: Felix Cheung <fe...@hotmail.com>
    
    Closes #16248 from felixcheung/rinstallurl.
    
    (cherry picked from commit 8a51cfdcad5f8397558ed2e245eb03650f37ce66)
    Signed-off-by: Shivaram Venkataraman <sh...@cs.berkeley.edu>

commit 9dc5fa5f77d910e44746c5866cb77565c4b761d9
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-13T06:31:22Z

    [SPARK-18796][SS] StreamingQueryManager should not block when starting a query
    
    ## What changes were proposed in this pull request?
    
    Major change in this PR:
    - Add `pendingQueryNames` and `pendingQueryIds` to track that are going to start but not yet put into `activeQueries` so that we don't need to hold a lock when starting a query.
    
    Minor changes:
    - Fix a potential NPE when the user sets `checkpointLocation` using SQLConf but doesn't specify a query name.
    - Add missing docs in `StreamingQueryListener`
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16220 from zsxwing/SPARK-18796.
    
    (cherry picked from commit 417e45c58484a6b984ad2ce9ba8f47aa0a9983fd)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 9f0e3be622c77f7a677ce2c930b6dba2f652df00
Author: wm624@hotmail.com <wm...@hotmail.com>
Date:   2016-12-13T06:41:11Z

    [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes
    
    ## What changes were proposed in this pull request?
    spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.
    
    ## How was this patch tested?
    
    Manual build html. Please see attached image for the result.
    ![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)
    
    Author: wm624@hotmail.com <wm...@hotmail.com>
    
    Closes #16222 from wangmiao1981/veg.
    
    (cherry picked from commit 2aa16d03db79a642cbe21f387441c34fc51a8236)
    Signed-off-by: Xiangrui Meng <me...@databricks.com>

commit 207107bca5e550657b02892eef74230787972d10
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2016-12-13T18:02:19Z

    [SPARK-18835][SQL] Don't expose Guava types in the JavaTypeInference API.
    
    This avoids issues during maven tests because of shading.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #16260 from vanzin/SPARK-18835.
    
    (cherry picked from commit f280ccf449f62a00eb4042dfbcf7a0715850fd4c)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit d5c4a5d06b3282aec8300d27510393161773061b
Author: jerryshao <ss...@hortonworks.com>
Date:   2016-12-13T18:37:45Z

    [SPARK-18840][YARN] Avoid throw exception when getting token renewal interval in non HDFS security environment
    
    ## What changes were proposed in this pull request?
    
    Fix `java.util.NoSuchElementException` when running Spark in non-hdfs security environment.
    
    In the current code, we assume `HDFS_DELEGATION_KIND` token will be found in Credentials. But in some cloud environments, HDFS is not required, so we should avoid this exception.
    
    ## How was this patch tested?
    
    Manually verified in local environment.
    
    Author: jerryshao <ss...@hortonworks.com>
    
    Closes #16265 from jerryshao/SPARK-18840.
    
    (cherry picked from commit 43298d157d58d5d03ffab818f8cdfc6eac783c55)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit 292a37f2455b12ef8dfbdaf5b905a69b8b5e3728
Author: Alex Bozarth <aj...@us.ibm.com>
Date:   2016-12-13T21:37:46Z

    [SPARK-18816][WEB UI] Executors Logs column only ran visibility check on initial table load
    
    ## What changes were proposed in this pull request?
    
    When I added a visibility check for the logs column on the executors page in #14382 the method I used only ran the check on the initial DataTable creation and not subsequent page loads. I moved the check out of the table definition and instead it runs on each page load. The jQuery DataTable functionality used is the same.
    
    ## How was this patch tested?
    
    Tested Manually
    
    No visible UI changes to screenshot.
    
    Author: Alex Bozarth <aj...@us.ibm.com>
    
    Closes #16256 from ajbozarth/spark18816.
    
    (cherry picked from commit aebf44e50b6b04b848829adbbe08b0f74f31eb32)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit f672bfdf9689c0ab74226b11785ada50b72cd488
Author: Shixiong Zhu <sh...@databricks.com>
Date:   2016-12-13T22:09:25Z

    [SPARK-18843][CORE] Fix timeout in awaitResultInForkJoinSafely (branch 2.1, 2.0)
    
    ## What changes were proposed in this pull request?
    
    This PR fixes the timeout value in `awaitResultInForkJoinSafely` for 2.1 and 2.0. Master has been fixed by https://github.com/apache/spark/pull/16230.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <sh...@databricks.com>
    
    Closes #16268 from zsxwing/SPARK-18843.

commit 25b97589e32ddc424df500059cd9962eb1b2fa6b
Author: Tathagata Das <ta...@gmail.com>
Date:   2016-12-13T22:14:25Z

    [SPARK-18834][SS] Expose event time stats through StreamingQueryProgress
    
    ## What changes were proposed in this pull request?
    
    - Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings.
    
    - Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started.
    
    ## How was this patch tested?
    
    Updated tests
    
    Author: Tathagata Das <ta...@gmail.com>
    
    Closes #16258 from tdas/SPARK-18834.
    
    (cherry picked from commit c68fb426d4ac05414fb402aa1f30f4c98df103ad)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17239: Using map function in spark for huge operation

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17239
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17239: Using map function in spark for huge operation

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17239
  
    Please close this @nischay21 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17239: Using map function in spark for huge operation

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17239
  
    @nischay21 Could you click "Close pull request" below? This is not the place where we are supposed to report issues. This causes a build failure mark on some branches.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17239: Using map function in spark for huge operation

Posted by nischay21 <gi...@git.apache.org>.

Github user nischay21 closed the pull request at:

    https://github.com/apache/spark/pull/17239


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org