You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/23 00:29:04 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

andygrove opened a new issue #1648:
URL: https://github.com/apache/arrow-datafusion/issues/1648


   **Describe the bug**
   
   I have a data set created by Apache Spark and I tried to query it from the DataFusion CLI. It failed, saying that a parquet file was corrupt.
   
   ```
    CREATE EXTERNAL TABLE store_sales STORED AS PARQUET LOCATION 'store_sales.dat';
   0 rows in set. Query took 0.002 seconds.
   ❯ select count(*) from store_sales;
   Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))
   ```
   
   I added some debug logging and found that it was actually trying to read the following file, which is not a Parquet file.
   
   ```
   store_sales.dat/.part-00005-5142b177-bacb-499d-b14f-12de4b94d9d9-c000.snappy.parquet.crc
   ```
   
   **To Reproduce**
   Create a non-Parquet file with a non-Parquet extension and put it in a directory along with some valid parquet files.
   
   **Expected behavior**
   Should only try and read files with file extension `.parquet`.
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #1648:
URL: https://github.com/apache/arrow-datafusion/issues/1648#issuecomment-1019520616


   It would also be nice if the error message could include the name of the file that is corrupt to make these issues easier to debug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1648:
URL: https://github.com/apache/arrow-datafusion/issues/1648#issuecomment-1019490399


   @houqp plz assign this to me 😊.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1648:
URL: https://github.com/apache/arrow-datafusion/issues/1648


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #1648:
URL: https://github.com/apache/arrow-datafusion/issues/1648#issuecomment-1019426734


   This is because we are not providing file extension as search suffix in https://github.com/apache/arrow-datafusion/blob/9c5ccae240ce38b084128e8d7ff0752d0e2318a6/datafusion/src/execution/context.rs#L232
   
   I think the right behavior should be providing a default extension suffix and let user override if they are using something different.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org