You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/08 16:11:03 UTC

[GitHub] [arrow] lidavidm opened a new pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

lidavidm opened a new pull request #10483:
URL: https://github.com/apache/arrow/pull/10483


   This adds a bit more context to the error messages, though maybe this is a bit wordy?
   
   ```
   >>> ds.dataset('dataset4', format="ipc")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/lidavidm/Code/upstream/arrow-12827/python/pyarrow/dataset.py", line 655, in dataset
       return _filesystem_dataset(source, **kwargs)
     File "/home/lidavidm/Code/upstream/arrow-12827/python/pyarrow/dataset.py", line 410, in _filesystem_dataset
       return factory.finish(schema)
     File "pyarrow/_dataset.pyx", line 2262, in pyarrow._dataset.DatasetFactory.finish
       return Dataset.wrap(GetResultValue(result))
     File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
       return check_status(status)
     File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
       raise ArrowInvalid(message)
   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'dataset4/foo.parquet': Could not open IPC input source 'dataset4/foo.parquet': File is too small: 9. Is this a 'ipc' file?
   >>> ds.dataset('dataset5', format="parquet")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/lidavidm/Code/upstream/arrow-12827/python/pyarrow/dataset.py", line 655, in dataset
       return _filesystem_dataset(source, **kwargs)
     File "/home/lidavidm/Code/upstream/arrow-12827/python/pyarrow/dataset.py", line 410, in _filesystem_dataset
       return factory.finish(schema)
     File "pyarrow/_dataset.pyx", line 2262, in pyarrow._dataset.DatasetFactory.finish
       return Dataset.wrap(GetResultValue(result))
     File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
       return check_status(status)
     File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
       raise IOError(message)
   OSError: Error creating dataset. Could not read schema from 'dataset5/foo.parquet': Could not open Parquet input source 'dataset5/foo.parquet': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-861682228


   Is everyone happy with the error message here? :slightly_smiling_face: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-857688984


   Alright, I added back the 'Is this a XYZ file' message.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-857686569


   I think the "Is this a XYZ file?" conveys the information quite clearly (and invites the user to check).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-856906270


   https://issues.apache.org/jira/browse/ARROW-12827


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #10483:
URL: https://github.com/apache/arrow/pull/10483


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-857649332


   This is even more wordy, but perhaps `If reading a different format than 'Parquet', pass the intended format to the dataset/factory`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-857614846


   I think the "Is this a XYZ file?" is actually quite useful, and not too verbose. Because it's something easy to get when reading a non-parquet file and you forget to specify the format (the default is "parquet", and not to infer it from the file, as users might expect)
   
   This relates to a PR @thisisnic did for improving this error message on the R side -> https://github.com/apache/arrow/pull/10326 (this PR might cover the custom handling you added in R?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10483: ARROW-12827: [C++] Improve error message for dataset discovery failure

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10483:
URL: https://github.com/apache/arrow/pull/10483#issuecomment-857617366


   > I think the "Is this a XYZ file?" is actually quite useful, and not too verbose
   
   Of course the "Could not open Parquet input source" part also already gives a hint for that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org