You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by AtulKumVerma <gi...@git.apache.org> on 2018/02/20 15:37:24 UTC

[GitHub] spark pull request #20642: i m not able to open Spark UI on local using loca...

GitHub user AtulKumVerma opened a pull request:

    https://github.com/apache/spark/pull/20642

    i m not able to open Spark UI on local using localhost:4040

    20-02-18 20:54:24:118 [WARN] org.spark_project.jetty.server.HttpChannel /jobs/ org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:384) 
    java.lang.NoSuchMethodError: javax.servlet.http.HttpServletRequest.isAsyncStarted()Z
    	at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:484)
    	at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    	at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    	at org.spark_project.jetty.server.Server.handle(Server.java:499)
    	at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    	at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    	at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
    	at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    	at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    	at java.lang.Thread.run(Thread.java:745)
    20-02-18 20:54:24:122 [WARN] org.spark_project.jetty.util.thread.QueuedThreadPool  org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:610) 
    java.lang.NoSuchMethodError: javax.servlet.http.HttpServletResponse.getStatus()I
    	at org.spark_project.jetty.server.handler.ErrorHandler.handle(ErrorHandler.java:112)
    	at org.spark_project.jetty.server.Response.sendError(Response.java:597)
    	at org.spark_project.jetty.server.HttpChannel.handleException(HttpChannel.java:487)
    	at org.spark_project.jetty.server.HttpConnection$HttpChannelOverHttp.handleException(HttpConnection.java:594)
    	at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:387)
    	at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    	at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
    	at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    	at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    	at java.lang.Thread.run(Thread.java:745)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20642.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20642
    
----
commit 2db523959658d8cd04f83c176e46f6bcdb745335
Author: sethah <sh...@...>
Date:   2018-01-10T07:32:47Z

    [SPARK-22993][ML] Clarify HasCheckpointInterval param doc
    
    ## What changes were proposed in this pull request?
    
    Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context.
    
    ## How was this patch tested?
    
    No tests necessary, just a doc update.
    
    Author: sethah <sh...@cloudera.com>
    
    Closes #20188 from sethah/als_checkpoint_doc.
    
    (cherry picked from commit 70bcc9d5ae33d6669bb5c97db29087ccead770fb)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit 60d4d79bb40f13c68773a0224f2003cdca28c138
Author: Josh Rosen <jo...@...>
Date:   2018-01-10T08:45:47Z

    [SPARK-22997] Add additional defenses against use of freed MemoryBlocks
    
    ## What changes were proposed in this pull request?
    
    This patch modifies Spark's `MemoryAllocator` implementations so that `free(MemoryBlock)` mutates the passed block to clear pointers (in the off-heap case) or null out references to backing `long[]` arrays (in the on-heap case). The goal of this change is to add an extra layer of defense against use-after-free bugs because currently it's hard to detect corruption caused by blind writes to freed memory blocks.
    
    ## How was this patch tested?
    
    New unit tests in `PlatformSuite`, including new tests for existing functionality because we did not have sufficient mutation coverage of the on-heap memory allocator's pooling logic.
    
    Author: Josh Rosen <jo...@databricks.com>
    
    Closes #20191 from JoshRosen/SPARK-22997-add-defenses-against-use-after-free-bugs-in-memory-allocator.
    
    (cherry picked from commit f340b6b3066033d40b7e163fd5fb68e9820adfb1)
    Signed-off-by: Josh Rosen <jo...@databricks.com>

commit 5b5851cb685f395574c94174d45a47c4fbf946c8
Author: Wang Gengliang <lt...@...>
Date:   2018-01-10T17:44:30Z

    [SPARK-23019][CORE] Wait until SparkContext.stop() finished in SparkLauncherSuite
    
    ## What changes were proposed in this pull request?
    In current code ,the function `waitFor` call https://github.com/apache/spark/blob/cfcd746689c2b84824745fa6d327ffb584c7a17d/core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java#L155 only wait until DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet.
    https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/core/src/main/scala/org/apache/spark/SparkContext.scala#L1924
    
    Thus, in the Jenkins test
    https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/ ,  `JdbcRDDSuite` failed because the previous test `SparkLauncherSuite` exit before SparkContext.stop() is finished.
    
    To repo:
    ```
    $ build/sbt
    > project core
    > testOnly *SparkLauncherSuite *JavaJdbcRDDSuite
    ```
    
    To Fix:
    Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM in SparkLauncherSuite.
    Can' come up with any better solution for now.
    
    ## How was this patch tested?
    
    Unit test
    
    Author: Wang Gengliang <lt...@gmail.com>
    
    Closes #20221 from gengliangwang/SPARK-23019.
    
    (cherry picked from commit 344e3aab87178e45957333479a07e07f202ca1fd)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit eb4fa551e60800269a939b2c1c0ad69e3a801264
Author: Feng Liu <fe...@...>
Date:   2018-01-10T22:25:04Z

    [SPARK-22951][SQL] fix aggregation after dropDuplicates on empty data frames
    
    ## What changes were proposed in this pull request?
    
    (courtesy of liancheng)
    
    Spark SQL supports both global aggregation and grouping aggregation. Global aggregation always return a single row with the initial aggregation state as the output, even there are zero input rows. Spark implements this by simply checking the number of grouping keys and treats an aggregation as a global aggregation if it has zero grouping keys.
    
    However, this simple principle drops the ball in the following case:
    
    ```scala
    spark.emptyDataFrame.dropDuplicates().agg(count($"*") as "c").show()
    // +---+
    // | c |
    // +---+
    // | 1 |
    // +---+
    ```
    
    The reason is that:
    
    1. `df.dropDuplicates()` is roughly translated into something equivalent to:
    
    ```scala
    val allColumns = df.columns.map { col }
    df.groupBy(allColumns: _*).agg(allColumns.head, allColumns.tail: _*)
    ```
    
    This translation is implemented in the rule `ReplaceDeduplicateWithAggregate`.
    
    2. `spark.emptyDataFrame` contains zero columns and zero rows.
    
    Therefore, rule `ReplaceDeduplicateWithAggregate` makes a confusing transformation roughly equivalent to the following one:
    
    ```scala
    spark.emptyDataFrame.dropDuplicates()
    => spark.emptyDataFrame.groupBy().agg(Map.empty[String, String])
    ```
    
    The above transformation is confusing because the resulting aggregate operator contains no grouping keys (because `emptyDataFrame` contains no columns), and gets recognized as a global aggregation. As a result, Spark SQL allocates a single row filled by the initial aggregation state and uses it as the output, and returns a wrong result.
    
    To fix this issue, this PR tweaks `ReplaceDeduplicateWithAggregate` by appending a literal `1` to the grouping key list of the resulting `Aggregate` operator when the input plan contains zero output columns. In this way, `spark.emptyDataFrame.dropDuplicates()` is now translated into a grouping aggregation, roughly depicted as:
    
    ```scala
    spark.emptyDataFrame.dropDuplicates()
    => spark.emptyDataFrame.groupBy(lit(1)).agg(Map.empty[String, String])
    ```
    
    Which is now properly treated as a grouping aggregation and returns the correct answer.
    
    ## How was this patch tested?
    
    New unit tests added
    
    Author: Feng Liu <fe...@databricks.com>
    
    Closes #20174 from liufengdb/fix-duplicate.
    
    (cherry picked from commit 9b33dfc408de986f4203bb0ac0c3f5c56effd69d)
    Signed-off-by: Cheng Lian <li...@gmail.com>

commit 551ccfba529996e987c4d2e8d4dd61c4ab9a2e95
Author: Bryan Cutler <cu...@...>
Date:   2018-01-10T05:55:24Z

    [SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas
    
    ## What changes were proposed in this pull request?
    
    This the case when calling `SparkSession.createDataFrame` using a Pandas DataFrame that has non-str column labels.
    
    The column name conversion logic to handle non-string or unicode in python2 is:
    ```
    if column is not any type of string:
        name = str(column)
    else if column is unicode in Python 2:
        name = column.encode('utf-8')
    ```
    
    ## How was this patch tested?
    
    Added a new test with a Pandas DataFrame that has int column labels
    
    Author: Bryan Cutler <cu...@gmail.com>
    
    Closes #20210 from BryanCutler/python-createDataFrame-int-col-error-SPARK-23009.

commit 317b0aaed83e4bbf66f63ddc0d618da9f1f85085
Author: Mingjie Tang <mt...@...>
Date:   2018-01-11T03:51:03Z

    [SPARK-22587] Spark job fails if fs.defaultFS and application jar are different url
    
    ## What changes were proposed in this pull request?
    
    Two filesystems comparing does not consider the authority of URI. This is specific for
    WASB file storage system, where userInfo is honored to differentiate filesystems.
    For example: wasbs://user1xyz.net, wasbs://user2xyz.net would consider as two filesystem.
    Therefore, we have to add the authority to compare two filesystem, and  two filesystem with different authority can not be the same FS.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Mingjie Tang <mt...@hortonworks.com>
    
    Closes #19885 from merlintang/EAR-7377.
    
    (cherry picked from commit a6647ffbf7a312a3e119a9beef90880cc915aa60)
    Signed-off-by: jerryshao <ss...@hortonworks.com>

commit d9a973d65c52169e3c3b2223d4a55b07ee82b88e
Author: gatorsmile <ga...@...>
Date:   2018-01-11T10:17:34Z

    [SPARK-23001][SQL] Fix NullPointerException when DESC a database with NULL description
    
    ## What changes were proposed in this pull request?
    When users' DB description is NULL, users might hit `NullPointerException`. This PR is to fix the issue.
    
    ## How was this patch tested?
    Added test cases
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20215 from gatorsmile/SPARK-23001.
    
    (cherry picked from commit 87c98de8b23f0e978958fc83677fdc4c339b7e6a)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit b78130123baba87554503e81b8aee3121666ba91
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-11T11:41:48Z

    [SPARK-20657][CORE] Speed up rendering of the stages page.
    
    There are two main changes to speed up rendering of the tasks list
    when rendering the stage page.
    
    The first one makes the code only load the tasks being shown in the
    current page of the tasks table, and information related to only
    those tasks. One side-effect of this change is that the graph that
    shows task-related events now only shows events for the tasks in
    the current page, instead of the previously hardcoded limit of "events
    for the first 1000 tasks". That ends up helping with readability,
    though.
    
    To make sorting efficient when using a disk store, the task wrapper
    was extended to include many new indices, one for each of the sortable
    columns in the UI, and metrics for which quantiles are calculated.
    
    The second changes the way metric quantiles are calculated for stages.
    Instead of using the "Distribution" class to process data for all task
    metrics, which requires scanning all tasks of a stage, the code now
    uses the KVStore "skip()" functionality to only read tasks that contain
    interesting information for the quantiles that are desired.
    
    This is still not cheap; because there are many metrics that the UI
    and API track, the code needs to scan the index for each metric to
    gather the information. Savings come mainly from skipping deserialization
    when using the disk store, but the in-memory code also seems to be
    faster than before (most probably because of other changes in this
    patch).
    
    To make subsequent calls faster, some quantiles are cached in the
    status store. This makes UIs much faster after the first time a stage
    has been loaded.
    
    With the above changes, a lot of code in the UI layer could be simplified.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20013 from vanzin/SPARK-20657.
    
    (cherry picked from commit 1c70da3bfbb4016e394de2c73eb0db7cdd9a6968)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit 79959890570d216c33069c8382b29d53977665b1
Author: wuyi5 <ng...@...>
Date:   2018-01-11T13:17:15Z

    [SPARK-22967][TESTS] Fix VersionSuite's unit tests by change Windows path into URI path
    
    ## What changes were proposed in this pull request?
    
    Two unit test will fail due to Windows format path:
    
    1.test(s"$version: read avro file containing decimal")
    ```
    org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
    ```
    
    2.test(s"$version: SPARK-17920: Insert into/overwrite avro table")
    ```
    Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.;
    org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.;
    ```
    
    This pr fix these two unit test by change Windows path into URI path.
    
    ## How was this patch tested?
    Existed.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: wuyi5 <ng...@163.com>
    
    Closes #20199 from Ngone51/SPARK-22967.
    
    (cherry picked from commit 0552c36e02434c60dad82024334d291f6008b822)
    Signed-off-by: hyukjinkwon <gu...@gmail.com>

commit 9ca0f6eaf6744c090cab4ac6720cf11c9b83915e
Author: gatorsmile <ga...@...>
Date:   2018-01-11T13:32:36Z

    [SPARK-23000][TEST-HADOOP2.6] Fix Flaky test suite DataSourceWithHiveMetastoreCatalogSuite
    
    ## What changes were proposed in this pull request?
    The Spark 2.3 branch still failed due to the flaky test suite `DataSourceWithHiveMetastoreCatalogSuite `. https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
    
    Although https://github.com/apache/spark/pull/20207 is unable to reproduce it in Spark 2.3, it sounds like the current DB of Spark's Catalog is changed based on the following stacktrace. Thus, we just need to reset it.
    
    ```
    [info] DataSourceWithHiveMetastoreCatalogSuite:
    02:40:39.486 ERROR org.apache.hadoop.hive.ql.parse.CalcitePlanner: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:14 Table not found 't'
    	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1594)
    	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1545)
    	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10077)
    	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
    	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
    	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
    	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
    	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
    	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
    	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:694)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:683)
    	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:683)
    	at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:673)
    	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:185)
    	at org.apache.spark.sql.test.SQLTestUtilsBase$class.withTable(SQLTestUtils.scala:273)
    	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:139)
    	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:163)
    	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163)
    	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$9$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:163)
    	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    	at org.scalatest.Transformer.apply(Transformer.scala:22)
    	at org.scalatest.Transformer.apply(Transformer.scala:20)
    	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
    	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
    	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
    	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
    	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
    	at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
    	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
    	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
    	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
    	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
    	at org.scalatest.Suite$class.run(Suite.scala:1147)
    	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
    	at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
    	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
    	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
    	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
    	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
    	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
    	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
    	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20218 from gatorsmile/testFixAgain.
    
    (cherry picked from commit 76892bcf2c08efd7e9c5b16d377e623d82fe695e)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit f624850fe8acce52240217f376316734a23be00b
Author: gatorsmile <ga...@...>
Date:   2018-01-11T13:33:42Z

    [SPARK-19732][FOLLOW-UP] Document behavior changes made in na.fill and fillna
    
    ## What changes were proposed in this pull request?
    https://github.com/apache/spark/pull/18164 introduces the behavior changes. We need to document it.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <ga...@gmail.com>
    
    Closes #20234 from gatorsmile/docBehaviorChange.
    
    (cherry picked from commit b46e58b74c82dac37b7b92284ea3714919c5a886)
    Signed-off-by: hyukjinkwon <gu...@gmail.com>

commit b94debd2b01b87ef1d2a34d48877e38ade0969e6
Author: Marcelo Vanzin <va...@...>
Date:   2018-01-11T18:37:35Z

    [SPARK-22994][K8S] Use a single image for all Spark containers.
    
    This change allows a user to submit a Spark application on kubernetes
    having to provide a single image, instead of one image for each type
    of container. The image's entry point now takes an extra argument that
    identifies the process that is being started.
    
    The configuration still allows the user to provide different images
    for each container type if they so desire.
    
    On top of that, the entry point was simplified a bit to share more
    code; mainly, the same env variable is used to propagate the user-defined
    classpath to the different containers.
    
    Aside from being modified to match the new behavior, the
    'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh'
    to more closely match its purpose; the old name was a little awkward
    and now also not entirely correct, since there is a single image. It
    was also moved to 'bin' since it's not necessarily an admin tool.
    
    Docs have been updated to match the new behavior.
    
    Tested locally with minikube.
    
    Author: Marcelo Vanzin <va...@cloudera.com>
    
    Closes #20192 from vanzin/SPARK-22994.
    
    (cherry picked from commit 0b2eefb674151a0af64806728b38d9410da552ec)
    Signed-off-by: Marcelo Vanzin <va...@cloudera.com>

commit f891ee3249e04576dd579cbab6f8f1632550e6bd
Author: Jose Torres <jo...@...>
Date:   2018-01-11T18:52:12Z

    [SPARK-22908] Add kafka source and sink for continuous processing.
    
    ## What changes were proposed in this pull request?
    
    Add kafka source and sink for continuous processing. This involves two small changes to the execution engine:
    
    * Bring data reader close() into the normal data reader thread to avoid thread safety issues.
    * Fix up the semantics of the RECONFIGURING StreamExecution state. State updates are now atomic, and we don't have to deal with swallowing an exception.
    
    ## How was this patch tested?
    
    new unit tests
    
    Author: Jose Torres <jo...@databricks.com>
    
    Closes #20096 from jose-torres/continuous-kafka.
    
    (cherry picked from commit 6f7aaed805070d29dcba32e04ca7a1f581fa54b9)
    Signed-off-by: Tathagata Das <ta...@gmail.com>

commit 2ec302658c98038962c9b7a90fd2cff751a35ffa
Author: Bago Amirbekian <ba...@...>
Date:   2018-01-11T21:57:15Z

    [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline
    
    ## What changes were proposed in this pull request?
    
    Including VectorSizeHint in RFormula piplelines will allow them to be applied to streaming dataframes.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: Bago Amirbekian <ba...@databricks.com>
    
    Closes #20238 from MrBago/rFormulaVectorSize.
    
    (cherry picked from commit 186bf8fb2e9ff8a80f3f6bcb5f2a0327fa79a1c9)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea
Author: Sameer Agarwal <sa...@...>
Date:   2018-01-11T23:23:10Z

    Preparing Spark release v2.3.0-rc1

commit 6bb22961c0c9df1a1f22e9491894895b297f5288
Author: Sameer Agarwal <sa...@...>
Date:   2018-01-11T23:23:17Z

    Preparing development version 2.3.1-SNAPSHOT

commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d
Author: WeichenXu <we...@...>
Date:   2018-01-12T00:20:30Z

    [SPARK-23008][ML] OnehotEncoderEstimator python API
    
    ## What changes were proposed in this pull request?
    
    OnehotEncoderEstimator python API.
    
    ## How was this patch tested?
    
    doctest
    
    Author: WeichenXu <we...@databricks.com>
    
    Closes #20209 from WeichenXu123/ohe_py.
    
    (cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37)
    Signed-off-by: Joseph K. Bradley <jo...@databricks.com>

commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2
Author: ho3rexqj <ho...@...>
Date:   2018-01-12T07:27:00Z

    [SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances of broadcast variable values
    
    When resources happen to be constrained on an executor the first time a broadcast variable is instantiated it is persisted to disk by the BlockManager. Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock from other instances of that broadcast variable spawns another instance of the underlying value. That is, broadcast variables are spawned once per executor **unless** memory is constrained, in which case every instance of a broadcast variable is provided with a unique copy of the underlying value.
    
    This patch fixes the above by explicitly caching the underlying values using weak references in a ReferenceMap.
    
    Author: ho3rexqj <ho...@gmail.com>
    
    Closes #20183 from ho3rexqj/fix/cache-broadcast-values.
    
    (cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea)
    Signed-off-by: Wenchen Fan <we...@databricks.com>

commit d512d873b3f445845bd113272d7158388427f8a6
Author: WeichenXu <we...@...>
Date:   2018-01-12T09:27:02Z

    [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated
    
    ## What changes were proposed in this pull request?
    
    mark OneHotEncoder python API deprecated
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <we...@databricks.com>
    
    Closes #20241 from WeichenXu123/mark_ohe_deprecated.
    
    (cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6)
    Signed-off-by: Nick Pentreath <ni...@za.ibm.com>

commit 6152da3893a05b3f8dc0f13895af9be9548e5895
Author: Marco Gaido <ma...@...>
Date:   2018-01-12T10:04:44Z

    [SPARK-23025][SQL] Support Null type in scala reflection
    
    ## What changes were proposed in this pull request?
    
    Add support for `Null` type in the `schemaFor` method for Scala reflection.
    
    ## How was this patch tested?
    
    Added UT
    
    Author: Marco Gaido <ma...@gmail.com>
    
    Closes #20219 from mgaido91/SPARK-23025.
    
    (cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit db27a93652780f234f3c5fe750ef07bc5525d177
Author: Dongjoon Hyun <do...@...>
Date:   2018-01-12T18:18:42Z

    [MINOR][BUILD] Fix Java linter errors
    
    ## What changes were proposed in this pull request?
    
    This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, this will be the final one.
    
    ```
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks failed at following occurrences:
    [ERROR] src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] (sizes) LineLength: Line is longer than 100 characters (found 101).
    [ERROR] src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] (imports) UnusedImports: Unused import - java.io.IOException.
    [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9] (modifier) ModifierOrder: 'private' modifier out of order with the JLS suggestions.
    [ERROR] src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java:[464] (sizes) LineLength: Line is longer than 100 characters (found 102).
    ```
    
    ## How was this patch tested?
    
    Manual.
    
    ```
    $ dev/lint-java
    Using `mvn` from path: /usr/local/bin/mvn
    Checkstyle checks passed.
    ```
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #20242 from dongjoon-hyun/fix_lint_java_2.3_rc1.
    
    (cherry picked from commit 7bd14cfd40500a0b6462cda647bdbb686a430328)
    Signed-off-by: Sameer Agarwal <sa...@apache.org>

commit 02176f4c2f60342068669b215485ffd443346aed
Author: Marco Gaido <ma...@...>
Date:   2018-01-12T19:25:37Z

    [SPARK-22975][SS] MetricsReporter should not throw exception when there was no progress reported
    
    ## What changes were proposed in this pull request?
    
    `MetricsReporter ` assumes that there has been some progress for the query, ie. `lastProgress` is not null. If this is not true, as it might happen in particular conditions, a `NullPointerException` can be thrown.
    
    The PR checks whether there is a `lastProgress` and if this is not true, it returns a default value for the metrics.
    
    ## How was this patch tested?
    
    added UT
    
    Author: Marco Gaido <ma...@gmail.com>
    
    Closes #20189 from mgaido91/SPARK-22975.
    
    (cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de)
    Signed-off-by: Shixiong Zhu <zs...@gmail.com>

commit 60bcb4685022c29a6ddcf707b505369687ec7da6
Author: Sameer Agarwal <sa...@...>
Date:   2018-01-12T23:07:14Z

    Revert "[SPARK-22908] Add kafka source and sink for continuous processing."
    
    This reverts commit f891ee3249e04576dd579cbab6f8f1632550e6bd.

commit ca27d9cb5e30b6a50a4c8b7d10ac28f4f51d44ee
Author: hyukjinkwon <gu...@...>
Date:   2018-01-13T07:13:44Z

    [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch.
    
    We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF.
    
    For example, please consider this example:
    
    ```python
    from pyspark.sql.functions import pandas_udf, col, lit
    
    df = spark.range(1)
    f = pandas_udf(lambda x, y: len(x) + y, LongType())
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 1|
    +------------------+
    ```
    
    ```python
    from pyspark.sql.functions import udf, col, lit
    
    df = spark.range(1)
    f = udf(lambda x, y: len(x) + y, "long")
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 4|
    +------------------+
    ```
    
    ## How was this patch tested?
    
    Manually built the doc and checked the output.
    
    Author: hyukjinkwon <gu...@gmail.com>
    
    Closes #20237 from HyukjinKwon/SPARK-22980.
    
    (cherry picked from commit cd9f49a2aed3799964976ead06080a0f7044a0c3)
    Signed-off-by: hyukjinkwon <gu...@gmail.com>

commit 801ffd799922e1c2751d3331874b88a67da8cf01
Author: Yuming Wang <yu...@...>
Date:   2018-01-13T16:01:44Z

    [SPARK-22870][CORE] Dynamic allocation should allow 0 idle time
    
    ## What changes were proposed in this pull request?
    
    This pr to make `0` as a valid value for `spark.dynamicAllocation.executorIdleTimeout`.
    For details, see the jira description: https://issues.apache.org/jira/browse/SPARK-22870.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Yuming Wang <yu...@ebay.com>
    Author: Yuming Wang <wg...@gmail.com>
    
    Closes #20080 from wangyum/SPARK-22870.
    
    (cherry picked from commit fc6fe8a1d0f161c4788f3db94de49a8669ba3bcc)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 8d32ed5f281317ba380aa6b8b3f3f041575022cb
Author: xubo245 <60...@...>
Date:   2018-01-13T18:28:57Z

    [SPARK-23036][SQL][TEST] Add withGlobalTempView for testing
    
    ## What changes were proposed in this pull request?
    
    Add withGlobalTempView when create global temp view, like withTempView and withView.
    And correct some improper usage.
    Please see jira.
    There are other similar place like that. I will fix it if community need. Please confirm it.
    ## How was this patch tested?
    
    no new test.
    
    Author: xubo245 <60...@qq.com>
    
    Closes #20228 from xubo245/DropTempView.
    
    (cherry picked from commit bd4a21b4820c4ebaf750131574a6b2eeea36907e)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 0fc5533e53ad03eb67590ddd231f40c2713150c3
Author: CodingCat <zh...@...>
Date:   2018-01-13T18:36:32Z

    [SPARK-22790][SQL] add a configurable factor to describe HadoopFsRelation's size
    
    ## What changes were proposed in this pull request?
    
    as per discussion in https://github.com/apache/spark/pull/19864#discussion_r156847927
    
    the current HadoopFsRelation is purely based on the underlying file size which is not accurate and makes the execution vulnerable to errors like OOM
    
    Users can enable CBO with the functionalities in https://github.com/apache/spark/pull/19864 to avoid this issue
    
    This JIRA proposes to add a configurable factor to sizeInBytes method in HadoopFsRelation class so that users can mitigate this problem without CBO
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: CodingCat <zh...@gmail.com>
    Author: Nan Zhu <na...@uber.com>
    
    Closes #20072 from CodingCat/SPARK-22790.
    
    (cherry picked from commit ba891ec993c616dc4249fc786c56ea82ed04a827)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit bcd87ae0775d16b7c3b9de0c4f2db36eb3679476
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-01-13T21:39:38Z

    [SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in compareAndGetNewStats
    
    ## What changes were proposed in this pull request?
    This pr fixed code to compare values in `compareAndGetNewStats`.
    The test below fails in the current master;
    ```
        val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) * 2)
        val newStats5 = CommandUtils.compareAndGetNewStats(
          Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None)
        assert(newStats5.isEmpty)
    ```
    
    ## How was this patch tested?
    Added some tests in `CommandUtilsSuite`.
    
    Author: Takeshi Yamamuro <ya...@apache.org>
    
    Closes #20245 from maropu/SPARK-21213-FOLLOWUP.
    
    (cherry picked from commit 0066d6f6fa604817468471832968d4339f71c5cb)
    Signed-off-by: gatorsmile <ga...@gmail.com>

commit 1f4a08b15ab47cf6c3bb08c783497422f30d0709
Author: foxish <ra...@...>
Date:   2018-01-14T05:34:28Z

    [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses)
    
    ## What changes were proposed in this pull request?
    
    Including the `-Pkubernetes` flag in a few places it was missed.
    
    ## How was this patch tested?
    
    checkstyle, mima through manual tests.
    
    Author: foxish <ra...@google.com>
    
    Closes #20256 from foxish/SPARK-23063.
    
    (cherry picked from commit c3548d11c3c57e8f2c6ebd9d2d6a3924ddcd3cba)
    Signed-off-by: Felix Cheung <fe...@apache.org>

commit a335a49ce4672b44e5f818145214040a67c722ba
Author: Dongjoon Hyun <do...@...>
Date:   2018-01-14T07:26:12Z

    [SPARK-23038][TEST] Update docker/spark-test (JDK/OS)
    
    ## What changes were proposed in this pull request?
    
    This PR aims to update the followings in `docker/spark-test`.
    
    - JDK7 -> JDK8
    Spark 2.2+ supports JDK8 only.
    
    - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel)
    The end of life of `precise` was April 28, 2017.
    
    ## How was this patch tested?
    
    Manual.
    
    * Master
    ```
    $ cd external/docker
    $ ./build
    $ export SPARK_HOME=...
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-master
    CONTAINER_IP=172.17.0.3
    ...
    18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and started at http://172.17.0.3:8080
    18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066.
    18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
    18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE
    ```
    
    * Slave
    ```
    $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://172.17.0.3:7077
    CONTAINER_IP=172.17.0.4
    ...
    18/01/11 06:51:54 INFO Worker: Successfully registered with master spark://172.17.0.3:7077
    ```
    
    After slave starts, master will show
    ```
    18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4 cores, 1024.0 MB RAM
    ```
    
    Author: Dongjoon Hyun <do...@apache.org>
    
    Closes #20230 from dongjoon-hyun/SPARK-23038.
    
    (cherry picked from commit 7a3d0aad2b89aef54f7dd580397302e9ff984d9d)
    Signed-off-by: Felix Cheung <fe...@apache.org>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20642: i m not able to open Spark UI on local using localhost:4...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/20642
  
    @AtulKumVerma please close this PR. You can ask questions in the user list or on stack overflow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20642: i m not able to open Spark UI on local using loca...

Posted by AtulKumVerma <gi...@git.apache.org>.

Github user AtulKumVerma closed the pull request at:

    https://github.com/apache/spark/pull/20642


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20642: i m not able to open Spark UI on local using localhost:4...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20642
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20642: i m not able to open Spark UI on local using localhost:4...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20642
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20642: i m not able to open Spark UI on local using localhost:4...

Posted by AtulKumVerma <gi...@git.apache.org>.

Github user AtulKumVerma commented on the issue:

    https://github.com/apache/spark/pull/20642
  
    i had resolve by removing (javax.servlet.servlet-api-2.5.jar) and use 3.0 or above .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org