You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by steveloughran <gi...@git.apache.org> on 2017/08/17 20:05:55 UTC

[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/18979

    [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible

    ## What changes were proposed in this pull request?
    
    `BasicWriteTaskStatsTracker.getFileSize()` to catch `FileNotFoundException`, log @ info and then return 0 as a file size.
    
    This ensures that if a newly created file isn't visible due to the store not always having create consistency, the metric collection doesn't cause the failure. 
    
    ## How was this patch tested?
    
    New test suite included, `BasicWriteTaskStatsTrackerSuite`. This not only checks the resilience to missing files, but verifies the existing logic as to how file statistics are gathered.
    
    Note that in the current implementation
    
    1. if you call `Tracker..getFinalStats()` more than once, the file size count will increase by size of the last file. This could be fixed by clearing the filename field inside `getFinalStats()` itself.
    
    2. If you pass in an empty or null string to `Tracker.newFile(path)` then IllegalArgumentException is raised, but only in `getFinalStats()`, rather than in `newFile`.  There's a test for this behaviour in the new suite, as it verifies that only FNFEs get swallowed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark cloud/SPARK-21762-missing-files-in-metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18979
    
----
commit 8ad28b9bcd6a56b963ab57a5b4937d10f492de33
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-08-17T19:35:35Z

    SPARK-21762 handle FNFE events in BasicWriteStatsTracker; add a suite of tests for various file states.
    
    Change-Id: I3269cb901a38b33e399ebef10b2dbcd51ccf9b75

commit 2a113fde1653743a3543df8ada395f320b826a3e
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-08-17T20:01:50Z

    SPARK-21762 add tests for "" and null filenames
    
    Change-Id: I38ac11c808849e2fd91f4931f4cb5cdfad43e2af

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r133913173
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala ---
    @@ -0,0 +1,212 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.nio.charset.Charset
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Test how BasicWriteTaskStatsTracker handles files.
    + */
    +class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
    +
    +  private val tempDir = Utils.createTempDir()
    +  private val tempDirPath = new Path(tempDir.toURI)
    +  private val conf = new Configuration()
    +  private val localfs = tempDirPath.getFileSystem(conf)
    +  private val data1 = "0123456789".getBytes(Charset.forName("US-ASCII"))
    +  private val data2 = "012".getBytes(Charset.forName("US-ASCII"))
    +  private val len1 = data1.length
    +  private val len2 = data2.length
    +
    +  /**
    +   * In teardown delete the temp dir.
    +   */
    +  protected override def afterAll(): Unit = {
    +    Utils.deleteRecursively(tempDir)
    +  }
    +
    +  /**
    +   * Assert that the stats match that expected.
    +   * @param tracker tracker to check
    +   * @param files number of files expected
    +   * @param bytes total number of bytes expected
    +   */
    +  private def assertStats(
    +      tracker: BasicWriteTaskStatsTracker,
    +      files: Int,
    +      bytes: Int): Unit = {
    +    val stats = finalStatus(tracker)
    +    assert(files === stats.numFiles, "Wrong number of files")
    +    assert(bytes === stats.numBytes, "Wrong byte count of file size")
    +  }
    +
    +  private def finalStatus(tracker: BasicWriteTaskStatsTracker): BasicWriteTaskStats = {
    +    tracker.getFinalStats().asInstanceOf[BasicWriteTaskStats]
    +  }
    +
    +  test("No files in run") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    assertStats(tracker, 0, 0)
    +  }
    +
    +  test("Missing File") {
    +    val missing = new Path(tempDirPath, "missing")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(missing.toString)
    +    assertStats(tracker, 1, 0)
    +  }
    +
    +  test("Empty filename is forwarded") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile("")
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("Null filename is only picked up in final status") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(null)
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("0 byte file") {
    +    val file = new Path(tempDirPath, "file0")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    touch(file)
    +    assertStats(tracker, 1, 0)
    +  }
    +
    +  test("File with data") {
    +    val file = new Path(tempDirPath, "file-with-data")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    write1(file)
    +    assertStats(tracker, 1, len1)
    +  }
    +
    +  test("Open file") {
    +    val file = new Path(tempDirPath, "file-open")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    val stream = localfs.create(file, true)
    +    try {
    +      assertStats(tracker, 1, 0)
    +      stream.write(data1)
    +      stream.flush()
    +      assert(1 === finalStatus(tracker).numFiles, "Wrong number of files")
    +    } finally {
    +      stream.close()
    +    }
    +  }
    +
    +  test("Two files") {
    +    val file1 = new Path(tempDirPath, "f-2-1")
    +    val file2 = new Path(tempDirPath, "f-2-2")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file1.toString)
    +    write1(file1)
    +    tracker.newFile(file2.toString)
    +    write2(file2)
    +    assertStats(tracker, 2, len1 + len2)
    +  }
    +
    +  test("Three files, last one empty") {
    +    val file1 = new Path(tempDirPath, "f-3-1")
    +    val file2 = new Path(tempDirPath, "f-3-2")
    +    val file3 = new Path(tempDirPath, "f-3-2")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file1.toString)
    +    write1(file1)
    +    tracker.newFile(file2.toString)
    +    write2(file2)
    +    tracker.newFile(file3.toString)
    +    touch(file3)
    +    assertStats(tracker, 3, len1 + len2)
    +  }
    +
    +  test("Three files, one not found") {
    +    val file1 = new Path(tempDirPath, "f-4-1")
    +    val file2 = new Path(tempDirPath, "f-4-2")
    +    val file3 = new Path(tempDirPath, "f-3-2")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    // file 1
    +    tracker.newFile(file1.toString)
    +    write1(file1)
    +
    +    // file 2 is noted, but not visible
    +    tracker.newFile(file2.toString)
    +    touch(file3)
    --- End diff --
    
    yeah, you are right. Spurious. Harmless but wrong. I was playing around with different sequences to see if I could confuse things. Will cut


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r144505454
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
       private def getFileSize(filePath: String): Long = {
         val path = new Path(filePath)
         val fs = path.getFileSystem(hadoopConf)
    -    fs.getFileStatus(path).getLen()
    +    try {
    +      fs.getFileStatus(path).getLen()
    +    } catch {
    +      case e: FileNotFoundException =>
    +        // may arise against eventually consistent object stores
    +        logInfo(s"File $path is not yet visible", e)
    --- End diff --
    
    say "Reported file size in job summary may be invalid"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    +1. This solves the regression on writing emtpy dataset with ORC format, too!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82730/testReport)** for PR 18979 at commit [`adab985`](https://github.com/apache/spark/commit/adab985d0455f3f549ebceb03528f0a6a45a31c0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82730/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    @adrian-ionescu wrote
    > is there a need for calling getFinalStats() more than once?
    
    No. As long as everyone is aware of it, it won't be an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Related to this, updated spec on [Hadoop output stream, Syncable and StreamCapabilities](https://github.com/steveloughran/hadoop/blob/s3/HADOOP-13327-outputstream-trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md).
    
    As the doc notes, object stores != filesystems, and while a lot can be done to preserve the metaphor on input, its on output where CRUD inconsistencies surface. along with the logic as "does a 0-byte file get created in create()", "when is data written?", etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #80803 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80803/testReport)** for PR 18979 at commit [`2a113fd`](https://github.com/apache/spark/commit/2a113fde1653743a3543df8ada395f320b826a3e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82732/testReport)** for PR 18979 at commit [`d3f96f6`](https://github.com/apache/spark/commit/d3f96f63f263f134a79e76e8bdd1961f333b7c7a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by adrian-ionescu <gi...@git.apache.org>.
Github user adrian-ionescu commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    To me, this looks good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Will review it tomorrow


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    done. Not writing 0-byte files will offer significant speedup against object stores, where the cost of a call to getFileStatus() can take hundreds of millis. I look forward to it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82732 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82732/testReport)** for PR 18979 at commit [`d3f96f6`](https://github.com/apache/spark/commit/d3f96f63f263f134a79e76e8bdd1961f333b7c7a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    The latest PR update pulls in @dongjoon-hyun's new test; to avoid merge conflict in the Insert suite I've rebased against master.
    
    1.  Everything handles missing files on output
    2. There's only one logInfo at the end of the execute call, so if many empty files are created, the logs aren't too noisy.
    3. There is now some implicit counting of how many files were missing `= submittedFiles - numFiles`; this isn't aggregated and reported though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r133919035
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala ---
    @@ -0,0 +1,212 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.nio.charset.Charset
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Test how BasicWriteTaskStatsTracker handles files.
    + */
    +class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
    +
    +  private val tempDir = Utils.createTempDir()
    +  private val tempDirPath = new Path(tempDir.toURI)
    +  private val conf = new Configuration()
    +  private val localfs = tempDirPath.getFileSystem(conf)
    +  private val data1 = "0123456789".getBytes(Charset.forName("US-ASCII"))
    +  private val data2 = "012".getBytes(Charset.forName("US-ASCII"))
    +  private val len1 = data1.length
    +  private val len2 = data2.length
    +
    +  /**
    +   * In teardown delete the temp dir.
    +   */
    +  protected override def afterAll(): Unit = {
    +    Utils.deleteRecursively(tempDir)
    +  }
    +
    +  /**
    +   * Assert that the stats match that expected.
    +   * @param tracker tracker to check
    +   * @param files number of files expected
    +   * @param bytes total number of bytes expected
    +   */
    +  private def assertStats(
    +      tracker: BasicWriteTaskStatsTracker,
    +      files: Int,
    +      bytes: Int): Unit = {
    +    val stats = finalStatus(tracker)
    +    assert(files === stats.numFiles, "Wrong number of files")
    +    assert(bytes === stats.numBytes, "Wrong byte count of file size")
    +  }
    +
    +  private def finalStatus(tracker: BasicWriteTaskStatsTracker): BasicWriteTaskStats = {
    +    tracker.getFinalStats().asInstanceOf[BasicWriteTaskStats]
    +  }
    +
    +  test("No files in run") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    assertStats(tracker, 0, 0)
    +  }
    +
    +  test("Missing File") {
    +    val missing = new Path(tempDirPath, "missing")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(missing.toString)
    +    assertStats(tracker, 1, 0)
    +  }
    +
    +  test("Empty filename is forwarded") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile("")
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("Null filename is only picked up in final status") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(null)
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("0 byte file") {
    +    val file = new Path(tempDirPath, "file0")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    touch(file)
    +    assertStats(tracker, 1, 0)
    --- End diff --
    
    I'm assuming that the file will *eventually* come into existence; that its absence straight after collection is simply a transient create inconsistency of the endpoint, like a brief caching of negative HEAD/GET requests (which AWS S3 does do as part of its DoS defences). The files will be there later.
    
    One option: count the #of missing files and include that in the report. It shouldn't be a metric most of the time though: never on a "real" FS or consistent object store, rarely on an inconsistent one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r144595826
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -44,20 +47,32 @@ case class BasicWriteTaskStats(
      * @param hadoopConf
      */
     class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
    -  extends WriteTaskStatsTracker {
    +  extends WriteTaskStatsTracker with Logging {
     
       private[this] var numPartitions: Int = 0
       private[this] var numFiles: Int = 0
    +  private[this] var submittedFiles: Int = 0
       private[this] var numBytes: Long = 0L
       private[this] var numRows: Long = 0L
     
    -  private[this] var curFile: String = null
    -
    +  private[this] var curFile: Option[String] = None
     
    -  private def getFileSize(filePath: String): Long = {
    +  /**
    +   * Get the size of the file expected to have been written by a worker.
    +   * @param filePath path to the file
    +   * @return the file size or None if the file was not found.
    +   */
    +  private def getFileSize(filePath: String): Option[Long] = {
         val path = new Path(filePath)
         val fs = path.getFileSystem(hadoopConf)
    -    fs.getFileStatus(path).getLen()
    +    try {
    +      Some(fs.getFileStatus(path).getLen())
    +    } catch {
    +      case e: FileNotFoundException =>
    +        // may arise against eventually consistent object stores
    +        logDebug(s"File $path is not yet visible", e)
    --- End diff --
    
    For the error messages, it looks okay for me. First, it's a debug message. Second, ORC writer bug will be fixed in Spark 2.3 in any way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r133896006
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala ---
    @@ -0,0 +1,212 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.nio.charset.Charset
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Test how BasicWriteTaskStatsTracker handles files.
    + */
    +class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
    +
    +  private val tempDir = Utils.createTempDir()
    +  private val tempDirPath = new Path(tempDir.toURI)
    +  private val conf = new Configuration()
    +  private val localfs = tempDirPath.getFileSystem(conf)
    +  private val data1 = "0123456789".getBytes(Charset.forName("US-ASCII"))
    +  private val data2 = "012".getBytes(Charset.forName("US-ASCII"))
    +  private val len1 = data1.length
    +  private val len2 = data2.length
    +
    +  /**
    +   * In teardown delete the temp dir.
    +   */
    +  protected override def afterAll(): Unit = {
    +    Utils.deleteRecursively(tempDir)
    +  }
    +
    +  /**
    +   * Assert that the stats match that expected.
    +   * @param tracker tracker to check
    +   * @param files number of files expected
    +   * @param bytes total number of bytes expected
    +   */
    +  private def assertStats(
    +      tracker: BasicWriteTaskStatsTracker,
    +      files: Int,
    +      bytes: Int): Unit = {
    +    val stats = finalStatus(tracker)
    +    assert(files === stats.numFiles, "Wrong number of files")
    +    assert(bytes === stats.numBytes, "Wrong byte count of file size")
    +  }
    +
    +  private def finalStatus(tracker: BasicWriteTaskStatsTracker): BasicWriteTaskStats = {
    +    tracker.getFinalStats().asInstanceOf[BasicWriteTaskStats]
    +  }
    +
    +  test("No files in run") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    assertStats(tracker, 0, 0)
    +  }
    +
    +  test("Missing File") {
    +    val missing = new Path(tempDirPath, "missing")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(missing.toString)
    +    assertStats(tracker, 1, 0)
    +  }
    +
    +  test("Empty filename is forwarded") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile("")
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("Null filename is only picked up in final status") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(null)
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("0 byte file") {
    +    val file = new Path(tempDirPath, "file0")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    touch(file)
    +    assertStats(tracker, 1, 0)
    --- End diff --
    
    We may not be able to differentiate between `0 byte file` and `Missing File` in final metrics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #80841 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80841/testReport)** for PR 18979 at commit [`f778213`](https://github.com/apache/spark/commit/f778213d3adee1f2d6c977f92c093930f3d6c013).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #80841 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80841/testReport)** for PR 18979 at commit [`f778213`](https://github.com/apache/spark/commit/f778213d3adee1f2d6c977f92c093930f3d6c013).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82731/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80841/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Currently *nobody should be using s3a:// at the the temp file destination*, which is the same as saying "nobody should be using s3a:// as the direct destination of work", not without a special committer (Netflix, IBM's stocator, ...) or without something to give S3 list consistency. Because today, task commit relies on a list & rename of all files in the task attempt dir, and if you don't get list consistency, you can miss out on files. If you ever hear anyone complaining "it takes too long to commit to s3" then they are using it this way. Tell them to use a consistency layer or to stop it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by adrian-ionescu <gi...@git.apache.org>.
Github user adrian-ionescu commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Thanks for the fix and tests, @steveloughran!
    Re 1. -- is there a need for calling `getFinalStats()` more than once? The function doc clearly states that it's not supported and may lead to undefined behaviour. Could be fixed, of course, but depending on the implementation of the stats tracker, that can be at the expense of additional memory or code complexity..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r133918269
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
       private def getFileSize(filePath: String): Long = {
         val path = new Path(filePath)
         val fs = path.getFileSystem(hadoopConf)
    -    fs.getFileStatus(path).getLen()
    +    try {
    +      fs.getFileStatus(path).getLen()
    +    } catch {
    +      case e: FileNotFoundException =>
    +        // may arise against eventually consistent object stores
    +        logInfo(s"File $path is not yet visible", e)
    +        0
    --- End diff --
    
    The problem is: what can be picked up if the file isn't yet reported as present by the endpoint? Adding a bool to say "results are unreliable" could be used as a warning.
    
    One thing to consider long term is: if hadoop FS output streams added a simple <String, Long> map of statistics, could they be picked up by committers & then aggregated in job reports. Hadoop filesystems have statistics (simple ones in  Hadoop <= 2.7, an arbitrary map of String -> Long in 2.8 (with [standard key names across filesystems](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageStatistics.java)), and certainly today S3ABlockOutputStream [collects stats on individual streams](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInstrumentation.java#L756). If that was made visible and collected, you could get a lot more detail on what is going on. Thoughts? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Has anyone had a look at this recently? 
    
    The problem still exists, and while downstream filesystems can address if they recognise the use case & lie about values, they will be returning invalid values to the caller: spark will be reporting the wrong values. At least with this PR Spark will get to make the decisions about how to react itself.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82745/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r144094437
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
       private def getFileSize(filePath: String): Long = {
         val path = new Path(filePath)
         val fs = path.getFileSystem(hadoopConf)
    -    fs.getFileStatus(path).getLen()
    +    try {
    +      fs.getFileStatus(path).getLen()
    +    } catch {
    +      case e: FileNotFoundException =>
    +        // may arise against eventually consistent object stores
    +        logInfo(s"File $path is not yet visible", e)
    --- End diff --
    
    Could you update the log message and indicate the size zero might be wrong? For example negative caching in S3 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82732/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Hi, @steveloughran .
    > is the issue with ORC that if there's nothing to write, it doesn't generate a file (so avoiding that issue with sometimes you get 0-byte ORC files & things downstream fail)?
    
    Yes, So far, Spark leave an empty directory in case of ORC.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82731/testReport)** for PR 18979 at commit [`649f8da`](https://github.com/apache/spark/commit/649f8da245443567b842b697a4d47e5241eb5946).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    > To mimic S3-like behavior, you can overwrite the file system spark.hadoop.fs.$scheme.impl"
    
    @gatorsmile: you will be able to do something better soon, as S3A is adding an inconsistent AWS client into `hadoop-aws` JAR, which you can then enable to guarantee consistency delays and inject intermittent faults into the system (throttling, transient network events). All it will take is a config option to switch to this client, plus the chaos-monkey-esque probabilities and delays. This is what I'm already using —you will be able to as well. That is, no need to switch clients, just go`spark.hadoop.fs.s3a.s3.client.factory.impl=org.apache.hadoop.fs.s3a.InconsistentS3ClientFactory` and wait for the stack traces.
    
    The S3A FS itself [needs to do more](https://issues.apache.org/jira/browse/HADOOP-14531) to handle throttling & failures (retry, add failure metrics so throttling & error rates can be measured).  Knowing throttling rates is important as it will help identify perf problems due to bad distribution of work across a bucket, excess use of KMS key lookup..., things that in surface in support calls.
    
    This patch restores Spark 2.3 to the behaviour it has in Spark 2.2: a brief delay between object creation and visibility does not cause the task to fail


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    @viirya : the new data writer API will allow for a broader set of stats to be propagated back from workers. When you are working with the object stores, an useful stat to get back is throttle count & retry count as they can be the cause of why things are slow ... and if it is due to throttling, throwing more workers at the job will actually slow things down. They'd be the ones to look at first


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Could you also include the [test cases](https://github.com/dongjoon-hyun/spark/blob/b545f281b19120cc2c9e4197cae4b1315969247d/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala#L2054-L2060) to [InsertSuite.scala](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala) ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80803/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r134460176
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
       private def getFileSize(filePath: String): Long = {
         val path = new Path(filePath)
         val fs = path.getFileSystem(hadoopConf)
    -    fs.getFileStatus(path).getLen()
    +    try {
    +      fs.getFileStatus(path).getLen()
    +    } catch {
    +      case e: FileNotFoundException =>
    +        // may arise against eventually consistent object stores
    +        logInfo(s"File $path is not yet visible", e)
    +        0
    --- End diff --
    
    +I could a comment in the docs somewhere to state that metrics in the cloud may not be consistent. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Gentle ping, @steveloughran !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Could you resolve the conflicts again?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Btw, as the file path passed to state tracker should be task temp file, is it common to directly use S3 as temp file output destination?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18979


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r133822436
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala ---
    @@ -0,0 +1,212 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.nio.charset.Charset
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Test how BasicWriteTaskStatsTracker handles files.
    + */
    +class BasicWriteTaskStatsTrackerSuite extends SparkFunSuite {
    +
    +  private val tempDir = Utils.createTempDir()
    +  private val tempDirPath = new Path(tempDir.toURI)
    +  private val conf = new Configuration()
    +  private val localfs = tempDirPath.getFileSystem(conf)
    +  private val data1 = "0123456789".getBytes(Charset.forName("US-ASCII"))
    +  private val data2 = "012".getBytes(Charset.forName("US-ASCII"))
    +  private val len1 = data1.length
    +  private val len2 = data2.length
    +
    +  /**
    +   * In teardown delete the temp dir.
    +   */
    +  protected override def afterAll(): Unit = {
    +    Utils.deleteRecursively(tempDir)
    +  }
    +
    +  /**
    +   * Assert that the stats match that expected.
    +   * @param tracker tracker to check
    +   * @param files number of files expected
    +   * @param bytes total number of bytes expected
    +   */
    +  private def assertStats(
    +      tracker: BasicWriteTaskStatsTracker,
    +      files: Int,
    +      bytes: Int): Unit = {
    +    val stats = finalStatus(tracker)
    +    assert(files === stats.numFiles, "Wrong number of files")
    +    assert(bytes === stats.numBytes, "Wrong byte count of file size")
    +  }
    +
    +  private def finalStatus(tracker: BasicWriteTaskStatsTracker): BasicWriteTaskStats = {
    +    tracker.getFinalStats().asInstanceOf[BasicWriteTaskStats]
    +  }
    +
    +  test("No files in run") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    assertStats(tracker, 0, 0)
    +  }
    +
    +  test("Missing File") {
    +    val missing = new Path(tempDirPath, "missing")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(missing.toString)
    +    assertStats(tracker, 1, 0)
    +  }
    +
    +  test("Empty filename is forwarded") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile("")
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("Null filename is only picked up in final status") {
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(null)
    +    intercept[IllegalArgumentException] {
    +      finalStatus(tracker)
    +    }
    +  }
    +
    +  test("0 byte file") {
    +    val file = new Path(tempDirPath, "file0")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    touch(file)
    +    assertStats(tracker, 1, 0)
    +  }
    +
    +  test("File with data") {
    +    val file = new Path(tempDirPath, "file-with-data")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    write1(file)
    +    assertStats(tracker, 1, len1)
    +  }
    +
    +  test("Open file") {
    +    val file = new Path(tempDirPath, "file-open")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file.toString)
    +    val stream = localfs.create(file, true)
    +    try {
    +      assertStats(tracker, 1, 0)
    +      stream.write(data1)
    +      stream.flush()
    +      assert(1 === finalStatus(tracker).numFiles, "Wrong number of files")
    +    } finally {
    +      stream.close()
    +    }
    +  }
    +
    +  test("Two files") {
    +    val file1 = new Path(tempDirPath, "f-2-1")
    +    val file2 = new Path(tempDirPath, "f-2-2")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file1.toString)
    +    write1(file1)
    +    tracker.newFile(file2.toString)
    +    write2(file2)
    +    assertStats(tracker, 2, len1 + len2)
    +  }
    +
    +  test("Three files, last one empty") {
    +    val file1 = new Path(tempDirPath, "f-3-1")
    +    val file2 = new Path(tempDirPath, "f-3-2")
    +    val file3 = new Path(tempDirPath, "f-3-2")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    tracker.newFile(file1.toString)
    +    write1(file1)
    +    tracker.newFile(file2.toString)
    +    write2(file2)
    +    tracker.newFile(file3.toString)
    +    touch(file3)
    +    assertStats(tracker, 3, len1 + len2)
    +  }
    +
    +  test("Three files, one not found") {
    +    val file1 = new Path(tempDirPath, "f-4-1")
    +    val file2 = new Path(tempDirPath, "f-4-2")
    +    val file3 = new Path(tempDirPath, "f-3-2")
    +    val tracker = new BasicWriteTaskStatsTracker(conf)
    +    // file 1
    +    tracker.newFile(file1.toString)
    +    write1(file1)
    +
    +    // file 2 is noted, but not visible
    +    tracker.newFile(file2.toString)
    +    touch(file3)
    --- End diff --
    
    `file3 `?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82731/testReport)** for PR 18979 at commit [`649f8da`](https://github.com/apache/spark/commit/649f8da245443567b842b697a4d47e5241eb5946).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Noted :)
    @dongjoon-hyun : is the issue with ORC that if there's nothing to write, it doesn't generate a file (so avoiding that issue with sometimes you get 0-byte ORC files & things downstream fail)?
    
    If so, the warning message which @gatorsmile has proposed is potentially going to mislead people into worrying about a problem which isn't there. and the numFiles metric is going to mislead.
    
    I'm starting to worry about how noisy the log would be, both there and when working with s3 when it's playing delayed visibility (rarer).
    
    1. What if this patch just logged at debug: less noise, but still something there if people are trying to debug a mismatch?
    1. if there's no file found, numFiles doesn't get incremented. 
    1. I count the number of files actually submitted
    1. And in `getFinalStats()` log @ info if there is a mismatch
    
    This would line things up in future for actually returning the list of expected vs actual files up as a metric where it could be reported.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82745/testReport)** for PR 18979 at commit [`c0e81a1`](https://github.com/apache/spark/commit/c0e81a1c87011efdc010f1c9ba28dde003458667).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #80803 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80803/testReport)** for PR 18979 at commit [`2a113fd`](https://github.com/apache/spark/commit/2a113fde1653743a3543df8ada395f320b826a3e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18979#discussion_r133887920
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ---
    @@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
       private def getFileSize(filePath: String): Long = {
         val path = new Path(filePath)
         val fs = path.getFileSystem(hadoopConf)
    -    fs.getFileStatus(path).getLen()
    +    try {
    +      fs.getFileStatus(path).getLen()
    +    } catch {
    +      case e: FileNotFoundException =>
    +        // may arise against eventually consistent object stores
    +        logInfo(s"File $path is not yet visible", e)
    +        0
    --- End diff --
    
    hm i feel this would be dangerous; nowhere did we document that this would return an incorrect size ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    I don't have strong opinion against this. Incorrect size is an issue but I can't think a better solution for now...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    thanks for the review everyone!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    To mimic S3-like behavior, you can overwrite the file system `spark.hadoop.fs.$scheme.impl`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82745/testReport)** for PR 18979 at commit [`c0e81a1`](https://github.com/apache/spark/commit/c0e81a1c87011efdc010f1c9ba28dde003458667).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    **[Test build #82730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82730/testReport)** for PR 18979 at commit [`adab985`](https://github.com/apache/spark/commit/adab985d0455f3f549ebceb03528f0a6a45a31c0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    LGTM except a minor comment. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    cc @adrian-ionescu 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18979
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org