You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/06/13 20:03:04 UTC
[GitHub] [iceberg] Fokko commented on pull request #7831: Compute parquet stats

Fokko commented on PR #7831:
URL: https://github.com/apache/iceberg/pull/7831#issuecomment-1589945848

   Hey @maxdebayser thanks for raising this PR. I had something different in mind with the issue. The main problem here is that we pull a lot of data through Python, which brings in a lot of issues (the binary conversion that's probably expensive, and limited scalability due to the GIL).
   
   I just did a quick stab (and should have done that sooner), and found the following:
   
   ```python
   ➜  Desktop python3 
   Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow as pa
   >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
   ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
   ...                              "Brittle stars", "Centipede"]})
   >>> metadata_collector = []
   >>> import pyarrow.parquet as pq
   >>> pq.write_to_dataset(
   ...     table, '/tmp/table',
   ...      metadata_collector=metadata_collector)
   >>> metadata_collector
   [<pyarrow._parquet.FileMetaData object at 0x11f955850>
     created_by: parquet-cpp-arrow version 11.0.0
     num_columns: 2
     num_rows: 6
     num_row_groups: 1
     format_version: 1.0
     serialized_size: 0]
   
   >>> metadata_collector[0].row_group(0)
   <pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
     num_columns: 2
     num_rows: 6
     total_byte_size: 256
   
   >>> metadata_collector[0].row_group(0).to_dict()
   {
   	'num_columns': 2,
   	'num_rows': 6,
   	'total_byte_size': 256,
   	'columns': [{
   		'file_offset': 119,
   		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
   		'physical_type': 'INT64',
   		'num_values': 6,
   		'path_in_schema': 'n_legs',
   		'is_stats_set': True,
   		'statistics': {
   			'has_min_max': True,
   			'min': 2,
   			'max': 100,
   			'null_count': 0,
   			'distinct_count': 0,
   			'num_values': 6,
   			'physical_type': 'INT64'
   		},
   		'compression': 'SNAPPY',
   		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
   		'has_dictionary_page': True,
   		'dictionary_page_offset': 4,
   		'data_page_offset': 46,
   		'total_compressed_size': 115,
   		'total_uncompressed_size': 117
   	}, {
   		'file_offset': 359,
   		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
   		'physical_type': 'BYTE_ARRAY',
   		'num_values': 6,
   		'path_in_schema': 'animal',
   		'is_stats_set': True,
   		'statistics': {
   			'has_min_max': True,
   			'min': 'Brittle stars',
   			'max': 'Parrot',
   			'null_count': 0,
   			'distinct_count': 0,
   			'num_values': 6,
   			'physical_type': 'BYTE_ARRAY'
   		},
   		'compression': 'SNAPPY',
   		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
   		'has_dictionary_page': True,
   		'dictionary_page_offset': 215,
   		'data_page_offset': 302,
   		'total_compressed_size': 144,
   		'total_uncompressed_size': 139
   	}]
   }
   ```
   
   I think it is much better to retrieve the min-max from there. This is done by PyArrow and is probably much faster than when we do it in Python. I really would like to stay away from processing data as much as possible.
   
   I think the complexity here is to map it back to the original columns (and it gets tricky I think with nested columns).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org