You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/01/05 05:05:52 UTC

[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16474

    [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet

    ## What changes were proposed in this pull request?
    
    We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:
    
    1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
    2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.
    
    This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.
    
    Two main changes in this patch:
    
    1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner
    
        We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.
    
    2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.
    
    One thing to notice is:
    
    We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`.  One concern is that it might also shadow other runtime exceptions other than reading corrupt files.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 fix-ignorecorrupted-parquet-files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16474
    
----
commit 586b347b04b64ddf2b70e4fb16035f80ad5a400e
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-01-05T04:02:13Z

    Make ignoreCorruptFiles work for Parquet.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #70903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70903/testReport)** for PR 16474 at commit [`586b347`](https://github.com/apache/spark/commit/586b347b04b64ddf2b70e4fb16035f80ad5a400e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    thanks, merging to master/2.1!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71053/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r95778417
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -135,11 +135,21 @@ class FileScanRDD(
               try {
                 if (ignoreCorruptFiles) {
                   currentIterator = new NextIterator[Object] {
    -                private val internalIter = readFunction(currentFile)
    +                private val internalIter = {
    +                  try {
    +                    // The readFunction may read files before consuming the iterator.
    +                    // E.g., vectorized Parquet reader.
    +                    readFunction(currentFile)
    --- End diff --
    
    is it possible that we can make `readFunction` guarantee that data reading must happen after the first `Iterator.next`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r94892677
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -543,6 +546,58 @@ object ParquetFileFormat extends Logging {
       }
     
       /**
    +   * Reads Parquet footers in multi-threaded manner.
    +   * If the config "spark.sql.files.ignoreCorruptFiles" is set to true, we will ignore the corrupted
    +   * files when reading footers.
    +   */
    +  private def readParquetFootersInParallel(
    +      conf: Configuration,
    +      partFiles: Seq[FileStatus],
    +      ignoreCorruptFiles: Boolean): Seq[Footer] = {
    +    val footers = partFiles.map { currentFile =>
    +      new Callable[Option[Footer]]() {
    --- End diff --
    
    ok. let me try to change this to parallel collections.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #71417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71417/testReport)** for PR 16474 at commit [`261e1b5`](https://github.com/apache/spark/commit/261e1b5f295ca35ed2635c75aa9f1b91d8805bd7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r94892571
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -543,6 +546,58 @@ object ParquetFileFormat extends Logging {
       }
     
       /**
    +   * Reads Parquet footers in multi-threaded manner.
    +   * If the config "spark.sql.files.ignoreCorruptFiles" is set to true, we will ignore the corrupted
    +   * files when reading footers.
    +   */
    +  private def readParquetFootersInParallel(
    +      conf: Configuration,
    +      partFiles: Seq[FileStatus],
    +      ignoreCorruptFiles: Boolean): Seq[Footer] = {
    +    val footers = partFiles.map { currentFile =>
    +      new Callable[Option[Footer]]() {
    --- End diff --
    
    this seems pretty convoluted. can we jsut use parallel collections to do this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r94892633
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -593,13 +650,10 @@ object ParquetFileFormat extends Logging {
                 new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))
               }.toSeq
     
    -          // Skips row group information since we only need the schema
    -          val skipRowGroups = true
    -
               // Reads footers in multi-threaded manner within each task
               val footers =
    -            ParquetFileReader.readAllFootersInParallel(
    -              serializedConf.value, fakeFileStatuses.asJava, skipRowGroups).asScala
    +            ParquetFileFormat.readParquetFootersInParallel(
    --- End diff --
    
    ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70903/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r96162649
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -135,11 +135,21 @@ class FileScanRDD(
               try {
                 if (ignoreCorruptFiles) {
                   currentIterator = new NextIterator[Object] {
    -                private val internalIter = readFunction(currentFile)
    +                private val internalIter = {
    +                  try {
    +                    // The readFunction may read files before consuming the iterator.
    +                    // E.g., vectorized Parquet reader.
    +                    readFunction(currentFile)
    +                  } catch {
    +                    case e @(_: RuntimeException | _: IOException) =>
    --- End diff --
    
    yeah, I have this concern too in the pr description.
    
    One problem is the error message is varying across data sources. To list all error messages here looks not a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71417/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71055/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r95777107
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -135,11 +135,21 @@ class FileScanRDD(
               try {
                 if (ignoreCorruptFiles) {
                   currentIterator = new NextIterator[Object] {
    -                private val internalIter = readFunction(currentFile)
    +                private val internalIter = {
    +                  try {
    +                    // The readFunction may read files before consuming the iterator.
    +                    // E.g., vectorized Parquet reader.
    +                    readFunction(currentFile)
    +                  } catch {
    +                    case e @(_: RuntimeException | _: IOException) =>
    +                      logWarning(s"Skipped the rest content in the corrupted file: $currentFile", e)
    +                      null
    --- End diff --
    
    return `Iterator.empty`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r96162528
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -135,11 +135,21 @@ class FileScanRDD(
               try {
                 if (ignoreCorruptFiles) {
                   currentIterator = new NextIterator[Object] {
    -                private val internalIter = readFunction(currentFile)
    +                private val internalIter = {
    +                  try {
    +                    // The readFunction may read files before consuming the iterator.
    +                    // E.g., vectorized Parquet reader.
    +                    readFunction(currentFile)
    --- End diff --
    
    I think it is hard to guarantee this because `readFunction` is coming from individual data source. Even we can modify current data sources, we may not be able to prevent other data sources doing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #71055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71055/testReport)** for PR 16474 at commit [`6b562eb`](https://github.com/apache/spark/commit/6b562eba8b7d2a508cfd6f972bc18a108c3ed044).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r94892194
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -593,13 +650,10 @@ object ParquetFileFormat extends Logging {
                 new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))
               }.toSeq
     
    -          // Skips row group information since we only need the schema
    -          val skipRowGroups = true
    -
               // Reads footers in multi-threaded manner within each task
               val footers =
    -            ParquetFileReader.readAllFootersInParallel(
    -              serializedConf.value, fakeFileStatuses.asJava, skipRowGroups).asScala
    +            ParquetFileFormat.readParquetFootersInParallel(
    --- End diff --
    
    We can't make it ignore corrupt files. So it successfully reads all footers or completely fails even just one footer is corrupt.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r94892533
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -593,13 +650,10 @@ object ParquetFileFormat extends Logging {
                 new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))
               }.toSeq
     
    -          // Skips row group information since we only need the schema
    -          val skipRowGroups = true
    -
               // Reads footers in multi-threaded manner within each task
               val footers =
    -            ParquetFileReader.readAllFootersInParallel(
    -              serializedConf.value, fakeFileStatuses.asJava, skipRowGroups).asScala
    +            ParquetFileFormat.readParquetFootersInParallel(
    --- End diff --
    
    Can we add some unit tests for readParquetFootersInParallel


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #70903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70903/testReport)** for PR 16474 at commit [`586b347`](https://github.com/apache/spark/commit/586b347b04b64ddf2b70e4fb16035f80ad5a400e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #71055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71055/testReport)** for PR 16474 at commit [`6b562eb`](https://github.com/apache/spark/commit/6b562eba8b7d2a508cfd6f972bc18a108c3ed044).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #71053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71053/testReport)** for PR 16474 at commit [`d6878e1`](https://github.com/apache/spark/commit/d6878e1087eb9d7c32c1084e908a967326c75087).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r94892001
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -593,13 +650,10 @@ object ParquetFileFormat extends Logging {
                 new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))
               }.toSeq
     
    -          // Skips row group information since we only need the schema
    -          val skipRowGroups = true
    -
               // Reads footers in multi-threaded manner within each task
               val footers =
    -            ParquetFileReader.readAllFootersInParallel(
    -              serializedConf.value, fakeFileStatuses.asJava, skipRowGroups).asScala
    +            ParquetFileFormat.readParquetFootersInParallel(
    --- End diff --
    
    what's happening to readAllFootersInParallel?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16474


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    ping @rxin I do address the previous comments. Can you review again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16474#discussion_r95777029
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -135,11 +135,21 @@ class FileScanRDD(
               try {
                 if (ignoreCorruptFiles) {
                   currentIterator = new NextIterator[Object] {
    -                private val internalIter = readFunction(currentFile)
    +                private val internalIter = {
    +                  try {
    +                    // The readFunction may read files before consuming the iterator.
    +                    // E.g., vectorized Parquet reader.
    +                    readFunction(currentFile)
    +                  } catch {
    +                    case e @(_: RuntimeException | _: IOException) =>
    --- End diff --
    
    shall we also check the error message? or `RuntimeException` may catch other unexpected exceptions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #71417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71417/testReport)** for PR 16474 at commit [`261e1b5`](https://github.com/apache/spark/commit/261e1b5f295ca35ed2635c75aa9f1b91d8805bd7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16474
  
    **[Test build #71053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71053/testReport)** for PR 16474 at commit [`d6878e1`](https://github.com/apache/spark/commit/d6878e1087eb9d7c32c1084e908a967326c75087).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org