You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/03 14:02:02 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue #1736: File Extension Agnostic Schema Inference

tustvold opened a new issue #1736:
URL: https://github.com/apache/arrow-datafusion/issues/1736


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I wrote a test approximating
   
   ```
   let file = tempfile::tempfile();
   
   // ... write parquet data ...
   
   let mut context = ExecutionContext::new();
   context.register_parquet("t", file.path().as_str())
   context.sql("select column from t");
   ```
   
   This would result in "Invalid identifier" errors, effectively claiming the column didn't exist. I verified the file existed, had the correct columns, etc... I was very confused :laughing: 
   
   Eventually I tracked this down to the schema being inferred as empty if the extension is not ".parquet", this feels unexpected
   
   **Describe the solution you'd like**
   
   Either "register_parquet" should return an error if the extension is missing, or `FileFormat::infer_schema` should be more agnostic to file extensions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1736: Parquet files without `.parquet` extension inferred as having no schema

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1736:
URL: https://github.com/apache/arrow-datafusion/issues/1736#issuecomment-1031521360


   >  but I do think silently inferring an empty schema makes for a pretty poor UX 😅
   
   I agree -- an error about "can not infer schema" seems much better than silently ignoring


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1736: Parquet files without `.parquet` extension inferred as having no schema

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1736:
URL: https://github.com/apache/arrow-datafusion/issues/1736#issuecomment-1031255521


   @Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues.
   
   I'm don't feel strongly whether the `.parquet` file should be mandatory or not, but I do think silently inferring an empty schema makes for a pretty poor UX :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1736: Parquet files without `.parquet` extension inferred as having no schema

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1736:
URL: https://github.com/apache/arrow-datafusion/issues/1736#issuecomment-1031255521


   @Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues.
   
   I don't feel strongly whether the `.parquet` file should be mandatory or not, but I do think silently inferring an empty schema makes for a pretty poor UX :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1736: Parquet files without `.parquet` extension inferred as having no schema

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1736:
URL: https://github.com/apache/arrow-datafusion/issues/1736#issuecomment-1029410536


   Change this description to be a bug, which I think better reflects what is going on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1736: Parquet files without `.parquet` extension inferred as having no schema

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on issue #1736:
URL: https://github.com/apache/arrow-datafusion/issues/1736#issuecomment-1031048883


   @tustvold  may i ask you how to write a tmp parquet file, IMHP i think all parquet data file should end .parquet. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org