You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "swyatt7 (via GitHub)" <gi...@apache.org> on 2023/03/01 16:59:22 UTC

[GitHub] [arrow] swyatt7 opened a new issue, #34403: Is there a way to construct the metadata_collector for an existing partitioned dataset?

swyatt7 opened a new issue, #34403:
URL: https://github.com/apache/arrow/issues/34403

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello,
   
   I am working with an already partitioned dataset that exists in the hive format as such:
   
   ```
   dataset/
      partition_col1=XX/
         partition_col2=YY/
            file.parquet
   ```
   
   We have already designed methods to construct this partitioned format, and are wondering if it is possible to create the `_metadata` and `_common_metadata` files _post-facto_ with pyarrow. It seems the only way to create these `_metadata` files so that it retains the partitioning schema is through the `pyarrow.parquet.write_to_dataset()` and passing in `metadata_collector` kwargg, and populating it that way. 
   
   Since our datasets are already partitioned in the manner we desire, is there a way to ONLY create the `metadata_collector` object to write the files WITHOUT using the `write_to_dataset` function? These datasets are large, and on abstract file systems (abfls, s3... etc) it takes around 4-5 minutes to load our datasets (around 4000 partitions) lazily with `dask.dataframe.read_parquet(abfls://dataset/partition_col1=*/partition_col2=*/file.parquet)`
   
   Any help would be appreciated :) 
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] swyatt7 commented on issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?

Posted by "swyatt7 (via GitHub)" <gi...@apache.org>.

swyatt7 commented on issue #34403:
URL: https://github.com/apache/arrow/issues/34403#issuecomment-1453038670

   Hello and thanks for the prompt reply.
   
   I implemented what you said and was able to get it to work :). I did run into a few tweaks that needs to be implemented that aren't well documented. Doing this, you'll have to set each partitioned file metadata's file-path before sending it into the `metadata_collector` with the `.set_file_path()`
   
   ```
   root_dir = /path/to/partitioned/dataset
   dataset = ps.dataset(root_dir, partitioning='hive', format='parquet')
   metadata_collector = []
           
   for f in dataset.files:
         md = pq.read_metadata(f)
         md.set_file_path(f.split(f'{root_dir}/')[1])
         metadata_collector.append(md)
   
   _meta_data_path = os.path.join(root_dir, '_metadata')
   _common_metadata_path = os.path.join(root_dir, '_common_metadata')
   
   pq.write_metadata(dataset.schema, _meta_data_path, metadata_collector=metadata_collector)
   pq.write_metadata(dataset.schema, _common_metadata_path)
   ```
   
   Another issue I ran into was that if a partition contains an empty parquet file (with only the header information), the `append_row_groups` will throw an error saying that the schemas don't match... even though the schemas do match. One is just an empty file. Once I got rid of those files, the metadata was written correctly.
   
   Again thanks for the help, I'll mark as closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] swyatt7 closed issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?

Posted by "swyatt7 (via GitHub)" <gi...@apache.org>.

swyatt7 closed issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?
URL: https://github.com/apache/arrow/issues/34403


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34403:
URL: https://github.com/apache/arrow/issues/34403#issuecomment-1451740151

   Yes, that's certainly possible. 
   
   If you check the example in the docs about this (https://arrow.apache.org/docs/dev/python/parquet.html#writing-metadata-and-common-metadata-files), you can see that in the end the `metadata_collector` is a list of FileMetaData objects, and this list is passed to `pq.write_metadata(..)` to write the actual `_metadata` file, which will combine all metadata objects in this list into a single metadata object to write.
   
   In the example, this `metadata_collector` list is built up while writing, but you can also create such a list by reading the metadata of existing files. Something like (didn't run this code, but the general idea should work):
   
   ```
   metadata_collector = []
   for path in <iterate over all files in the dataset>:
       meta = pq.read_metadata(path)
       metadata_collector.append(meta)
   ```
       
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org