You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/26 21:40:11 UTC
[GitHub] [arrow] westonpace commented on issue #10803: Reading strings efficiently in C++
westonpace commented on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887046168
> The first thing I'm wondering is, is the output from parquet-meta below saying that this column is a string with PLAIN_DICTIONARY encoding?
I'm not familiar with `parquet-meta` but yes, that would be my interpretation. I get similar output from pyarrow when looking at a file I know is dictionary encoded:
```
>>> import pyarrow
>>> import pyarrow.parquet as pq
>>> long_str = 'x' * 10000000
>>> arr = pyarrow.array([long_str, long_str, long_str])
>>> table = pyarrow.Table.from_arrays([arr], ["data"])
>>> pq.write_table(table, "/tmp/foo.parquet")
>>> parquet_file = pq.ParquetFile('/tmp/foo.parquet')
>>> parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f55b7d6aa80>
file_offset: 469121
file_path:
physical_type: BYTE_ARRAY
num_values: 3
path_in_schema: data
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f55dfc66e40>
has_min_max: False
min: None
max: None
null_count: 0
distinct_count: 0
num_values: 3
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE', 'PLAIN')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 469089
total_compressed_size: 469117
total_uncompressed_size: 10000053
```
Here I know it is dictionary encoded because the total uncompressed size is 10MB (and there are 3 values in my array each of which should be 10MB on its own).
> Because the code I've been using to access those strings doesn't return indexes, it returns actual strings:
I'm a little confused by this point. In the code below you are creating two pointers, `index` should be a pointer to the indices and `view` should be a pointer to the data. This is how dictionary arrays are typically stored. One buffer (usually with lots of elements) for indices and another buffer (usually with a small number of elements) for values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org