You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "bveeramani (via GitHub)" <gi...@apache.org> on 2023/06/24 02:45:29 UTC

[GitHub] [arrow] bveeramani opened a new issue, #36278: [Python] Add `exclude_invalid_files` to `ParquetDatasource`

bveeramani opened a new issue, #36278:
URL: https://github.com/apache/arrow/issues/36278

   ### Describe the enhancement requested
   
   Add a `exclude_invalid_files` parameter to `ParquetDatasource`.
   
   I want to read Parquet files from a bucket that contains a JSON metadata file. Because `ParquetDatasource` doesn't expose `exclude_invalid_files`, I get an error:
   
   ```
     File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/parquet_datasource.py", line 204, in __init__
       pq_ds = pq.ParquetDataset(
               ^^^^^^^^^^^^^^^^^^
     File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1776, in __new__
       return _ParquetDatasetV2(
              ^^^^^^^^^^^^^^^^^^
     File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2490, in __init__
       self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 763, in dataset
       return _filesystem_dataset(source, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 456, in _filesystem_dataset
       return factory.finish(schema)
              ^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 2752, in pyarrow._dataset.DatasetFactory.finish
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'ray-example-data/iris.json'. Is this a 'parquet' file?: Could not open Parquet input source 'ray-example-data/iris.json': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
   ```
   
   This issue is motivated by https://github.com/ray-project/ray/issues/36753.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] Add `exclude_invalid_files` to `ParquetDataset` [arrow]

Posted by "bveeramani (via GitHub)" <gi...@apache.org>.
bveeramani commented on issue #36278:
URL: https://github.com/apache/arrow/issues/36278#issuecomment-1793603679

   Ah, we were able to find a workaround, so we aren't working on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on issue #36278: [Python] Add `exclude_invalid_files` to `ParquetDataset`

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on issue #36278:
URL: https://github.com/apache/arrow/issues/36278#issuecomment-1619829110

   I think having an option for `exclude_invalid_files` in `pyarrow.parquet.ParquetDataset` is a good feature. Happy to review once you get the PR submitted 👍  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bveeramani commented on issue #36278: [Python] Add `exclude_invalid_files` to `ParquetDatasource`

Posted by "bveeramani (via GitHub)" <gi...@apache.org>.
bveeramani commented on issue #36278:
URL: https://github.com/apache/arrow/issues/36278#issuecomment-1605241167

   I'm happy to open a PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] Add `exclude_invalid_files` to `ParquetDataset` [arrow]

Posted by "juhlie (via GitHub)" <gi...@apache.org>.
juhlie commented on issue #36278:
URL: https://github.com/apache/arrow/issues/36278#issuecomment-1783118686

   Hi, is this feature still in progress?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org