You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/08 12:14:38 UTC

[GitHub] [arrow-datafusion] capkurmagati opened a new issue #1527: Error reading Parquet files after schema evolution

capkurmagati opened a new issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527

**Describe the bug**
A clear and concise description of what the bug is.

(I'm not sure if it's a arrow-rs or arrow-datafusion bug)
Read parquet files with evolved schema can get an error at
https://github.com/apache/arrow-rs/blob/6.0.0/parquet/src/schema/types.rs#L886-L895
It seems that physical plan doesn't pass the desired schema to parquet reader
https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L422
and the `ParquetFileArrowReader` can only infer schema from file
https://github.com/apache/arrow-rs/blob/6.4.0/parquet/src/arrow/arrow_reader.rs#L86-L92

**To Reproduce**
Steps to reproduce the behavior:

1. Create a parquet file with schema `col_1 int`
2. Create another parquet file with schema `col_1 int, col_2 int`
2. Implement a `TableProvider` that uses `ParquetExec` and also specifies the schema col_1 int, col_2 int` in `scan`
3. Register the table and `select * from the_table` (since `*` contains `col_2` but the some file doesn't have that)

Or
1. Create a parquet file with schema `col_1 int`
2. Create another parquet file with schema `col_1 int, col_2 int`
3. Create external table via cli and `select * from the_table`
Will got the following error
> Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))

**Expected behavior**
A clear and concise description of what you expected to happen.

The query gets executed without error and returns `NULL` for `col_2` if the file doesn't contain the data.

**Additional context**
Add any other context about the problem here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp edited a comment on issue #1527: Error reading Parquet files after schema evolution

Posted by GitBox <gi...@apache.org>.

houqp edited a comment on issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527#issuecomment-1009175393


   Probably caused by https://github.com/apache/arrow-datafusion/issues/132? If so, the fix should be fairly straightforward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] capkurmagati closed issue #1527: Error reading Parquet files after schema evolution

Posted by GitBox <gi...@apache.org>.

capkurmagati closed issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] capkurmagati commented on issue #1527: Error reading Parquet files after schema evolution

Posted by GitBox <gi...@apache.org>.

capkurmagati commented on issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527#issuecomment-1020186958


   @tustvold @alamb Thanks for the pointer and sorry that I couldn't respond quickly.
   I wrote some tests and verified that my problem got resolved by #1622. Let me close this issue.
   Thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #1527: Error reading Parquet files after schema evolution

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527#issuecomment-1009175393


   Probably caused by https://github.com/apache/arrow-datafusion/issues/132?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1527: Error reading Parquet files after schema evolution

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527#issuecomment-1013687826


   Not sure if related, but in IOx we handle this at the query layer with a thing we call [SchemaAdapterStream](https://github.com/influxdata/influxdb_iox/blob/f3f6f335a93d2910a5cc55e12662dfda82143701/query/src/provider/adapter.rs). This is created with an output schema and then inserts null columns into the RecordBatch that pass through it as needed.
   
   There are some IOx-specific details, but I suspect a generic version could be extracted for use by Datafusion.
   
   @alamb might have more thoughts on this as the original author of that component


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1527: Error reading Parquet files after schema evolution

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1527:
URL: https://github.com/apache/arrow-datafusion/issues/1527#issuecomment-1014610661


   Thanks for the report @capkurmagati  -- I am not sure if your usecase ever worked (in which case it is a bug).
   
   Regardless, as @tustvold  mentions, we basically have the same usecase in IOx where some parquet files have a subset of the unified schema and we pad the remaining columns with NULLs. 
   
   This picture might help https://github.com/influxdata/influxdb_iox/blob/f3f6f335a93d2910a5cc55e12662dfda82143701/query/src/provider/adapter.rs#L45-L72
   
   We would be happy to contribute this to DataFusion / the file reader. @capkurmagati  is there any chance you can write an end to end test (aka make the two parquet files you refer to above)? If so bringing in the `SchemaAdapter` stream would be pretty straightforward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org