You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/29 08:31:27 UTC

[GitHub] [arrow-datafusion] stuartcarnie opened a new issue, #2982: External parquet table fails when schema contains differing key / value metadata

stuartcarnie opened a new issue, #2982:
URL: https://github.com/apache/arrow-datafusion/issues/2982

   **Describe the bug**
   
   Calling `create external table` for a set of `parquet` files fails with an error:
   
   ```
   ArrowError(SchemaError("Fail to merge schema due to conflicting metadata."))
   ```
   
   Despite the arrow schema matching across all files.
   
   **To Reproduce**
   
   Create two parquet files with the same structure (columns and data types), but with differing values for a custom key and store this in the metadata.
   
   **Expected behavior**
   
   The following should succeed as the file data schemas match:
   
   ```
   create external table docker_container_cpu stored as parquet location 'mydata/1/13/1/';
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2982: External parquet table fails when schema contains differing key / value metadata

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2982:
URL: https://github.com/apache/arrow-datafusion/issues/2982#issuecomment-1199941680

   Amusingly / sadly there is a test for this (that I wrote) 🤦  -- it checks for different metadata but not incompatible metadata:
   
   https://github.com/apache/arrow-datafusion/blob/3d4c7ef/datafusion/core/tests/sql/parquet.rs#L176-L226
   
   I will fix this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] stuartcarnie commented on issue #2982: External parquet table fails when schema contains differing key / value metadata

Posted by GitBox <gi...@apache.org>.
stuartcarnie commented on issue #2982:
URL: https://github.com/apache/arrow-datafusion/issues/2982#issuecomment-1199968275

   @alamb I have a hacky solution that removes all the metadata except the `ARROW:schema` key in this code path. 
   
   I saw there is an option in the arrow-rs repo for the `ParquetFileArrowReader` to skip metadata:
   
   https://github.com/apache/arrow-rs/blob/bedeb4f66663a868846c713bdd8ff0c5bd0983d4/parquet/src/arrow/arrow_reader.rs#L268-L270
   
   However, a similar option wasn't available here. 
   
   If you have a suggested direction for a fix, I am happy to work on a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2982: External parquet table fails when schema contains differing key / value metadata

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2982:
URL: https://github.com/apache/arrow-datafusion/issues/2982#issuecomment-1200001227

   Thanks @stuartcarnie ! It was such a good one  I was working on a PR as well when you wrote this message (well, the test mostly) that basically does the same (adds a option to skip metadata)
   
   The draft is here, but I ran out of time today: https://github.com/apache/arrow-datafusion/pull/2985
   
   I think I like your naming of "skip" better than what I have there now. I will also look at the `ParquetFileArrowReader` 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2982: External parquet table fails when schema contains differing key / value metadata

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2982:
URL: https://github.com/apache/arrow-datafusion/issues/2982#issuecomment-1200134437

   https://github.com/apache/arrow-datafusion/pull/2985 is ready for review and I have verified that the usecase described in this ticket works now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] stuartcarnie commented on issue #2982: External parquet table fails when schema contains differing key / value metadata

Posted by GitBox <gi...@apache.org>.
stuartcarnie commented on issue #2982:
URL: https://github.com/apache/arrow-datafusion/issues/2982#issuecomment-1200045845

   Excellent, thanks @alamb!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen closed issue #2982: External parquet table fails when schema contains differing key / value metadata

Posted by GitBox <gi...@apache.org>.
yjshen closed issue #2982: External parquet table fails when schema contains differing key / value metadata
URL: https://github.com/apache/arrow-datafusion/issues/2982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org