You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "maxdebayser (via GitHub)" <gi...@apache.org> on 2023/06/13 20:48:34 UTC

[GitHub] [iceberg] maxdebayser commented on pull request #7831: Compute parquet stats

maxdebayser commented on PR #7831:
URL: https://github.com/apache/iceberg/pull/7831#issuecomment-1590005818

   @Fokko, I understand your concern, I think it's because we have different use cases in mind.
   
   If I understand correctly you want to write a pyarrow.Table to a partitioned dataset with write_dataset. Therefore computing min/max on the whole Table is not what you need because you actually need the min/max for the columns of the individual files. (Just pointing out that with the metadata collector you get the stats for the row chunks, so you'll still have to compute the stats for the file from those).
   
   I'm coming from a different use case. I would like to write from Ray using something like https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html#ray.data.Dataset.write_parquet . In this case there is no global pyarrow.Table that represent the dataset, Pyarrow tables are the blocks of the dataset that each individual ray task sees, for example in `map_batches`. In this scenario the pyarrow.write_dataset cannot be used because the full dataset is not entirely loaded into the memory of any compute node. In this scenario the GIL is also not a big concern because ray uses multiple worker processes.
   
   I think we have to see if there is a way to have a single API for both use cases or if we'll need to have different API calls for both. In the second case it would be better to share part of the implementation to ensure that the behavior is consistent, but it could perhaps lead to bad performance.
   
   Regarding the efficiency, the pyarrow.compute.min function is implemented in C++, so I think the performance is probably not a huge concern here. But I can try to compare both approaches with a large enough data set to measure it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org