You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by raofu <gi...@git.apache.org> on 2018/08/20 17:12:40 UTC

[GitHub] spark pull request #22157: [SPARK-25126] Avoid creating Reader for all orc f...

GitHub user raofu opened a pull request:

    https://github.com/apache/spark/pull/22157

    [SPARK-25126] Avoid creating Reader for all orc files

    In OrFileOperator.ReadSchema, a Reader is created for every file
    although only the first valid one is used. This uses significant
    amount of memory when there `paths` have a lot of files. In 2.3
    a different code path OrcUtils.readSchema is used for inferring
    schema for orc files. This commit change both function to creat
    Reader lazily.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/raofu/spark SPARK-25126

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22157
    
----
commit 5a86b3618da695431c01ddbe4bb102a45f93b3b1
Author: Rao Fu <ra...@...>
Date:   2018-08-17T23:40:05Z

    [SPARK-25126] Avoid creating Reader for all orc files
    
    In OrFileOperator.ReadSchema, a Reader is created for every file
    although only the first valid one is used. This uses significant
    amount of memory when there `paths` have a lot of files. In 2.3
    a different code path OrcUtils.readSchema is used for inferring
    schema for orc files. This commit change both function to creat
    Reader lazily.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212229343
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    -      CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    +        CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    Yeah, I think so. But in Parquet, schema merging is done in parallel. So it won't create all readers at once place.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Thank you for merging, @HyukjinKwon . And, thank you for making a PR, @raofu .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212373914
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    -      CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    +        CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    @viirya . The corrupt files are not ignored. Spark will throw `SparkException` while reading the content.
    > Now if Orc source reads the first valid schema, it doesn't read other Orc files further. So the corrupt files are ignored when SQLConf.IGNORE_CORRUPT_FILES is false.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212304433
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala ---
    @@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest {
           }
         }
     
    +    def testAllCorruptFiles(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
    +    def testAllCorruptFilesWithoutSchemaInfer(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.schema("a long").orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") {
           testIgnoreCorruptFiles()
           testIgnoreCorruptFilesWithoutSchemaInfer()
    +      val m1 = intercept[AnalysisException] {
    +        testAllCorruptFiles()
    +      }.getMessage
    +      assert(m1.contains("Unable to infer schema for ORC"))
    +      testAllCorruptFilesWithoutSchemaInfer()
         }
     
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
           val m1 = intercept[SparkException] {
             testIgnoreCorruptFiles()
           }.getMessage
    -      assert(m1.contains("Could not read footer for file"))
    +      assert(m1.contains("Malformed ORC file"))
    --- End diff --
    
    Ok. It's reasonable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    The failure in OrcQuerySuite looks legitimate. It's because it corrupts the third file of three, then sets the reader to not ignore corrupt files, but never actually reads the third file now with this change. I think that might be a good thing. @dongjoon-hyun do you have an opinion?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212223394
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    -      CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    +        CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    This might be a behavior change.
    
    Previously if there are corrupt files, once `SQLConf.IGNORE_CORRUPT_FILES` is false, Orc source will throw exception when reading those files.
    
    Now if Orc source reads the first valid schema, it doesn't read other Orc files further. So the corrupt files are ignored when `SQLConf.IGNORE_CORRUPT_FILES` is false.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by raofu <gi...@git.apache.org>.

Github user raofu commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    I fixed the test by making the first file the corrupted file. @srowen, can you help kick off a Jenkins run?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22157


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Jenkins is usually retriggered when it detects the change. Maybe, it seems to be busy.
    - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/
    
    BTW, it seems to be your first contribution. Welcome! You need to update PR description like the other commits, e.g., https://github.com/apache/spark/commit/883f3aff67aac25c9d9a3bdf8d47fadefbf9645b? Please put your description into the placeholder `(Please fill in changes proposed in this fix)` and `Pass the Jenkins with a new added test case` instead of the followings placeholders.
    ```
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    **[Test build #4285 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4285/testReport)** for PR 22157 at commit [`9afdac6`](https://github.com/apache/spark/commit/9afdac6da180b9c8959696941701c734e9a3fe8e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212225224
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala ---
    @@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest {
           }
         }
     
    +    def testAllCorruptFiles(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
    +    def testAllCorruptFilesWithoutSchemaInfer(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.schema("a long").orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") {
           testIgnoreCorruptFiles()
           testIgnoreCorruptFilesWithoutSchemaInfer()
    +      val m1 = intercept[AnalysisException] {
    +        testAllCorruptFiles()
    +      }.getMessage
    +      assert(m1.contains("Unable to infer schema for ORC"))
    +      testAllCorruptFilesWithoutSchemaInfer()
         }
     
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
           val m1 = intercept[SparkException] {
             testIgnoreCorruptFiles()
           }.getMessage
    -      assert(m1.contains("Could not read footer for file"))
    +      assert(m1.contains("Malformed ORC file"))
    --- End diff --
    
    I think this is because of the behavior change https://github.com/apache/spark/pull/22157#discussion_r212223394.
    
    Previously Orc source reads the third file which is corrupt and throws the exception of `could not read footer for file`.
    
    Now Orc source reads the first file for valid schema and skips other two files. When Orc source uses the schema to read the second Orc file, the schema is not consistent, so the exception of `Malformed ORC file` is thrown.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212178321
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala ---
    @@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest {
           }
         }
     
    +    def testAllCorruptFiles(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
    +    def testAllCorruptFilesWithoutSchemaInfer(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.schema("a long").orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") {
           testIgnoreCorruptFiles()
           testIgnoreCorruptFilesWithoutSchemaInfer()
    +      val m1 = intercept[AnalysisException] {
    +        testAllCorruptFiles()
    +      }.getMessage
    +      assert(m1.contains("Unable to infer schema for ORC"))
    +      testAllCorruptFilesWithoutSchemaInfer()
         }
     
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
           val m1 = intercept[SparkException] {
             testIgnoreCorruptFiles()
           }.getMessage
    -      assert(m1.contains("Could not read footer for file"))
    +      assert(m1.contains("Malformed ORC file"))
    --- End diff --
    
    why the error message changed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Do we have a similar issue for Parquet?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    **[Test build #4282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4282/testReport)** for PR 22157 at commit [`5a86b36`](https://github.com/apache/spark/commit/5a86b3618da695431c01ddbe4bb102a45f93b3b1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    > Do we have a similar issue for Parquet?
    
    Looks not since we explicitly pick up one file before reading in schema inference: 
    
    https://github.com/apache/spark/blob/f984ec75ed6162ee6f5881716a8311c883aca22a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L229-L239


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    **[Test build #4283 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4283/testReport)** for PR 22157 at commit [`5a86b36`](https://github.com/apache/spark/commit/5a86b3618da695431c01ddbe4bb102a45f93b3b1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Thank you for pinging me, @srowen .
    
    @raofu Instead of changing the existing test coverage, we had better add additional test cases which all files are corrupted.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    cc @gatorsmile and @cloud-fan .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    **[Test build #4282 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4282/testReport)** for PR 22157 at commit [`5a86b36`](https://github.com/apache/spark/commit/5a86b3618da695431c01ddbe4bb102a45f93b3b1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    **[Test build #4283 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4283/testReport)** for PR 22157 at commit [`5a86b36`](https://github.com/apache/spark/commit/5a86b3618da695431c01ddbe4bb102a45f93b3b1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212227737
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    -      CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    +        CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    BTW, I think we have to create a reader for each file when implementing schema merging like parquet, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212475271
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ---
    @@ -79,9 +79,10 @@ object OrcUtils extends Logging {
         val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles
         val conf = sparkSession.sessionState.newHadoopConf()
         // TODO: We need to support merge schema. Please see SPARK-11412.
    -    files.map(_.getPath).flatMap(readSchema(_, conf, ignoreCorruptFiles)).headOption.map { schema =>
    -      logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    -      CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    +    files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {
    +      case Some(schema) =>
    +        logDebug(s"Reading schema from file $files, got Hive schema string: $schema")
    +        CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
    --- End diff --
    
    Yeah, it is only ignored during reading schema.
    
    The change is the timing when the corrupt files are detected. Now it is postponed to actually reading file contents.
    
    That might not be a big deal, though in user experience it is better to throw such exception early.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212252932
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala ---
    @@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest {
           }
         }
     
    +    def testAllCorruptFiles(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
    +    def testAllCorruptFilesWithoutSchemaInfer(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.schema("a long").orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") {
           testIgnoreCorruptFiles()
           testIgnoreCorruptFilesWithoutSchemaInfer()
    +      val m1 = intercept[AnalysisException] {
    +        testAllCorruptFiles()
    +      }.getMessage
    +      assert(m1.contains("Unable to infer schema for ORC"))
    +      testAllCorruptFilesWithoutSchemaInfer()
         }
     
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
           val m1 = intercept[SparkException] {
             testIgnoreCorruptFiles()
           }.getMessage
    -      assert(m1.contains("Could not read footer for file"))
    +      assert(m1.contains("Malformed ORC file"))
    --- End diff --
    
    Let's make sure we don't backport it ... then I think it's fine. I sounds rather a bug to read and validate all schemas (which is inconsistent with Parquet) where we only needs to pick up single file. I don't think we make a guarantee about the pinking order.
    
    The possible behaviour change is when only read its schema. Previous code would throw an exception but after this PR it wouldn't.
    
    The previous behaviour is something we should expect when mergeSchema option is implemented within ORC side as you guys talked below.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by raofu <gi...@git.apache.org>.

Github user raofu commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    @dongjoon-hyun, thanks lot for the pointers! I've update the PR description. Please let me know if there is any other information you'd like me to add.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by raofu <gi...@git.apache.org>.

Github user raofu commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    @dongjoon-hyun Title updated. Thanks for adding the test coverage! I've merged your commit. Can you help kick off another Jenkins run? I don't think I have the permission to do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22157
  
    **[Test build #4285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4285/testReport)** for PR 22157 at commit [`9afdac6`](https://github.com/apache/spark/commit/9afdac6da180b9c8959696941701c734e9a3fe8e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22157#discussion_r212302555
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala ---
    @@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest {
           }
         }
     
    +    def testAllCorruptFiles(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
    +    def testAllCorruptFilesWithoutSchemaInfer(): Unit = {
    +      withTempDir { dir =>
    +        val basePath = dir.getCanonicalPath
    +        spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString)
    +        spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString)
    +        val df = spark.read.schema("a long").orc(
    +          new Path(basePath, "first").toString,
    +          new Path(basePath, "second").toString)
    +        assert(df.count() == 0)
    +      }
    +    }
    +
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") {
           testIgnoreCorruptFiles()
           testIgnoreCorruptFilesWithoutSchemaInfer()
    +      val m1 = intercept[AnalysisException] {
    +        testAllCorruptFiles()
    +      }.getMessage
    +      assert(m1.contains("Unable to infer schema for ORC"))
    +      testAllCorruptFilesWithoutSchemaInfer()
         }
     
         withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") {
           val m1 = intercept[SparkException] {
             testIgnoreCorruptFiles()
           }.getMessage
    -      assert(m1.contains("Could not read footer for file"))
    +      assert(m1.contains("Malformed ORC file"))
    --- End diff --
    
    I agree with this take


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org