You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "swyatt7 (via GitHub)" <gi...@apache.org> on 2023/03/03 06:20:34 UTC

[GitHub] [arrow] swyatt7 commented on issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?

swyatt7 commented on issue #34403:
URL: https://github.com/apache/arrow/issues/34403#issuecomment-1453038670

   Hello and thanks for the prompt reply.
   
   I implemented what you said and was able to get it to work :). I did run into a few tweaks that needs to be implemented that aren't well documented. Doing this, you'll have to set each partitioned file metadata's file-path before sending it into the `metadata_collector` with the `.set_file_path()`
   
   ```
   root_dir = /path/to/partitioned/dataset
   dataset = ps.dataset(root_dir, partitioning='hive', format='parquet')
   metadata_collector = []
           
   for f in dataset.files:
         md = pq.read_metadata(f)
         md.set_file_path(f.split(f'{root_dir}/')[1])
         metadata_collector.append(md)
   
   _meta_data_path = os.path.join(root_dir, '_metadata')
   _common_metadata_path = os.path.join(root_dir, '_common_metadata')
   
   pq.write_metadata(dataset.schema, _meta_data_path, metadata_collector=metadata_collector)
   pq.write_metadata(dataset.schema, _common_metadata_path)
   ```
   
   Another issue I ran into was that if a partition contains an empty parquet file (with only the header information), the `append_row_groups` will throw an error saying that the schemas don't match... even though the schemas do match. One is just an empty file. Once I got rid of those files, the metadata was written correctly.
   
   Again thanks for the help, I'll mark as closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org