You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mmwinther (via GitHub)" <gi...@apache.org> on 2023/02/23 14:23:05 UTC

[GitHub] [arrow] mmwinther commented on issue #30481: [Python] Reading Hive-style partitioned parquet files from GCS

mmwinther commented on issue #30481:
URL: https://github.com/apache/arrow/issues/30481#issuecomment-1441883373

   We experience this same bug with PyArrow v11. Tested that the same partitioned directory works with:
   
   - PyArrow v10
   - Use of `use_legacy_dataset=False`
   
   ## Code example
   ```python
   import gcsfs
   import pyarrow as pa
    
   gcs = gcsfs.GCSFileSystem()
   parquet_ds = pq.ParquetDataset("<redacted-bucket-name>/partition/dir", filesystem=gcs)
   ```
   
   ## Details of filesystem object
   ```
   gcs.isdir(partition_base_dir_path)
   >> True
   ```
   ```
   gcs.exists(partition_base_dir_path)
   >> True
   ```
   ```
   gcs.info(partition_base_dir_path)
   >> {'bucket': '<redacted-bucket-name>',
       'name': '<redacted-bucket-name>/partition/dir',
       'size': 0,
       'storageClass': 'DIRECTORY',
       'type': 'directory'}
   ```
   ```
   gcs.find(partition_base_dir_path, maxdepth=None, withdirs=True, detail=True)
   {'<redacted-bucket-name>/partition/dir': {'Key': 'partition/dir',
     'Size': 0,
     'name': '<redacted-bucket-name>/partition/dir',
     'StorageClass': 'DIRECTORY',
     'type': 'directory',
     'size': 0},
    '<redacted-bucket-name>/partition/dir/': {'kind': 'storage#object',
     'id': '<redacted-bucket-name>/partition/dir//1645614930653539',
     'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F',
     'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F?generation=1645614930653539&alt=media',
     'name': '<redacted-bucket-name>/partition/dir/',
     'bucket': '<redacted-bucket-name>',
     'generation': '1645614930653539',
     'metageneration': '1',
     'contentType': 'application/octet-stream',
     'storageClass': 'STANDARD',
     'size': 0,
     'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==',
     'crc32c': 'AAAAAA==',
     'etag': 'COPig6vZlfYCEAE=',
     'timeCreated': '2022-02-23T11:15:30.656Z',
     'updated': '2022-02-23T11:15:30.656Z',
     'timeStorageClassUpdated': '2022-02-23T11:15:30.656Z',
     'type': 'file'},
    '<redacted-bucket-name>/partition/dir/_SUCCESS': {'kind': 'storage#object',
     'id': '<redacted-bucket-name>/partition/dir/_SUCCESS/1645614930849546',
     'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F_SUCCESS',
     'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2F_SUCCESS?generation=1645614930849546&alt=media',
     'name': '<redacted-bucket-name>/partition/dir/_SUCCESS',
     'bucket': '<redacted-bucket-name>',
     'generation': '1645614930849546',
     'metageneration': '1',
     'contentType': 'application/octet-stream',
     'storageClass': 'STANDARD',
     'size': 0,
     'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==',
     'crc32c': 'AAAAAA==',
     'etag': 'CIrej6vZlfYCEAE=',
     'timeCreated': '2022-02-23T11:15:30.851Z',
     'updated': '2022-02-23T11:15:30.851Z',
     'timeStorageClassUpdated': '2022-02-23T11:15:30.851Z',
     'type': 'file'},
    '<redacted-bucket-name>/partition/dir/part-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet': {'kind': 'storage#object',
     'id': '<redacted-bucket-name>/partition/dir/part-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet/1645614930433213',
     'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
     'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet?generation=1645614930433213&alt=media',
     'name': '<redacted-bucket-name>/partition/dir/part-00000-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
     'bucket': '<redacted-bucket-name>',
     'generation': '1645614930433213',
     'metageneration': '1',
     'contentType': 'application/octet-stream',
     'storageClass': 'STANDARD',
     'size': 993,
     'md5Hash': 'U/QwH2SM2un91crFW2lqWA==',
     'crc32c': '1iQeEw==',
     'etag': 'CL2p9qrZlfYCEAE=',
     'timeCreated': '2022-02-23T11:15:30.435Z',
     'updated': '2022-02-23T11:15:30.435Z',
     'timeStorageClassUpdated': '2022-02-23T11:15:30.435Z',
     'type': 'file'},
    '<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet': {'kind': 'storage#object',
     'id': '<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet/1645614930179084',
     'selfLink': 'https://www.googleapis.com/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
     'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/<redacted-bucket-name>/o/partition%2Fdir%2Fpart-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet?generation=1645614930179084&alt=media',
     'name': '<redacted-bucket-name>/partition/dir/part-00001-d2924cb4-ae44-4401-8bfd-b67a4c9ab9ce-c000.snappy.parquet',
     'bucket': '<redacted-bucket-name>',
     'generation': '1645614930179084',
     'metageneration': '1',
     'contentType': 'application/octet-stream',
     'storageClass': 'STANDARD',
     'size': 993,
     'md5Hash': 'b5UF9wxWOK9lzAl9HnSMjw==',
     'crc32c': '1Es0QA==',
     'etag': 'CIzo5qrZlfYCEAE=',
     'timeCreated': '2022-02-23T11:15:30.188Z',
     'updated': '2022-02-23T11:15:30.188Z',
     'timeStorageClassUpdated': '2022-02-23T11:15:30.188Z',
     'type': 'file'}}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org