You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mmwinther (via GitHub)" <gi...@apache.org> on 2023/02/23 14:23:05 UTC
[GitHub] [arrow] mmwinther commented on issue #30481: [Python] Reading Hive-style partitioned parquet files from GCS
mmwinther commented on issue #30481:
URL: https://github.com/apache/arrow/issues/30481#issuecomment-1441883373
We experience this same bug with PyArrow v11. Tested that the same partitioned directory works with:
- PyArrow v10
- Use of `use_legacy_dataset=False`
## Code example
```python
import gcsfs
import pyarrow as pa
gcs = gcsfs.GCSFileSystem()
parquet_ds = pq.ParquetDataset("<redacted-bucket-name>/partition/dir", filesystem=gcs)
```
## Details of filesystem object
```
gcs.isdir(partition_base_dir_path)
>> True
```
```
gcs.exists(partition_base_dir_path)
>> True
```
```
gcs.info(partition_base_dir_path)
>> {'bucket': '<redacted-bucket-name>',
'name': '<redacted-bucket-name>/partition/dir',
'size': 0,
'storageClass': 'DIRECTORY',
'type': 'directory'}
```
```
gcs.find(partition_base_dir_path, maxdepth=None, withdirs=True, detail=True)
{'<redacted-bucket-name>/partition/dir': {'Key': 'partition/dir',
'Size': 0,
'name': '<redacted-bucket-name>/partition/dir',
'StorageClass': 'DIRECTORY',
'type': 'directory',
'size': 0},
'<redacted-bucket-name>/partition/dir/': {'kind': 'storage#object',
'id': '<redacted-bucket-name>/partition/dir//1645614930653539',
'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F',
'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F?generation=1645614930653539&alt=media',
'name': '<redacted-bucket-name>/partition/dir/',
'bucket': '<redacted-bucket-name>',
'generation': '1645614930653539',
'metageneration': '1',
'contentType': 'application/octet-stream',
'storageClass': 'STANDARD',
'size': 0,
'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==',
'crc32c': 'AAAAAA==',
'etag': 'COPig6vZlfYCEAE=',
'timeCreated': '2022-02-23T11:15:30.656Z',
'updated': '2022-02-23T11:15:30.656Z',
'timeStorageClassUpdated': '2022-02-23T11:15:30.656Z',
'type': 'file'},
'<redacted-bucket-name>/partition/dir/_SUCCESS': {'kind': 'storage#object',
'id': '<redacted-bucket-name>/partition/dir/_SUCCESS/1645614930849546',
'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F_SUCCESS',
'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F_SUCCESS?generation=1645614930849546&alt=media',
'name': '<redacted-bucket-name>/partition/dir/_SUCCESS',
'bucket': '<redacted-bucket-name>',
'generation': '1645614930849546',
'metageneration': '1',
'contentType': 'application/octet-stream',
'storageClass': 'STANDARD',
'size': 0,
'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==',
'crc32c': 'AAAAAA==',
'etag': 'CIrej6vZlfYCEAE=',
'timeCreated': '2022-02-23T11:15:30.851Z',
'updated': '2022-02-23T11:15:30.851Z',
'timeStorageClassUpdated': '2022-02-23T11:15:30.851Z',
'type': 'file'},
'<redacted-bucket-name>/partition/dir/part-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet': {'kind': 'storage#object',
'id': '<redacted-bucket-name>/partition/dir/part-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet/1645614930433213',
'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet?generation=1645614930433213&alt=media',
'name': '<redacted-bucket-name>/partition/dir/part-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
'bucket': '<redacted-bucket-name>',
'generation': '1645614930433213',
'metageneration': '1',
'contentType': 'application/octet-stream',
'storageClass': 'STANDARD',
'size': 993,
'md5Hash': 'U/QwH2SM2un91crFW2lqWA==',
'crc32c': '1iQeEw==',
'etag': 'CL2p9qrZlfYCEAE=',
'timeCreated': '2022-02-23T11:15:30.435Z',
'updated': '2022-02-23T11:15:30.435Z',
'timeStorageClassUpdated': '2022-02-23T11:15:30.435Z',
'type': 'file'},
'<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet': {'kind': 'storage#object',
'id': '<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet/1645614930179084',
'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet?generation=1645614930179084&alt=media',
'name': '<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
'bucket': '<redacted-bucket-name>',
'generation': '1645614930179084',
'metageneration': '1',
'contentType': 'application/octet-stream',
'storageClass': 'STANDARD',
'size': 993,
'md5Hash': 'b5UF9wxWOK9lzAl9HnSMjw==',
'crc32c': '1Es0QA==',
'etag': 'CIzo5qrZlfYCEAE=',
'timeCreated': '2022-02-23T11:15:30.188Z',
'updated': '2022-02-23T11:15:30.188Z',
'timeStorageClassUpdated': '2022-02-23T11:15:30.188Z',
'type': 'file'}}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org