You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/26 21:40:11 UTC

[GitHub] [arrow] westonpace commented on issue #10803: Reading strings efficiently in C++

westonpace commented on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887046168


   > The first thing I'm wondering is, is the output from parquet-meta below saying that this column is a string with PLAIN_DICTIONARY encoding?
   
   I'm not familiar with `parquet-meta` but yes, that would be my interpretation.  I get similar output from pyarrow when looking at a file I know is dictionary encoded:
   
   ```
   >>> import pyarrow
   >>> import pyarrow.parquet as pq
   >>> long_str = 'x' * 10000000
   >>> arr = pyarrow.array([long_str, long_str, long_str])
   >>> table = pyarrow.Table.from_arrays([arr], ["data"])
   >>> pq.write_table(table, "/tmp/foo.parquet")
   >>> parquet_file = pq.ParquetFile('/tmp/foo.parquet')
   >>> parquet_file.metadata.row_group(0).column(0)
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f55b7d6aa80>
     file_offset: 469121
     file_path: 
     physical_type: BYTE_ARRAY
     num_values: 3
     path_in_schema: data
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f55dfc66e40>
         has_min_max: False
         min: None
         max: None
         null_count: 0
         distinct_count: 0
         num_values: 3
         physical_type: BYTE_ARRAY
         logical_type: String
         converted_type (legacy): UTF8
     compression: SNAPPY
     encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE', 'PLAIN')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 469089
     total_compressed_size: 469117
     total_uncompressed_size: 10000053
   ```
   
   Here I know it is dictionary encoded because the total uncompressed size is 10MB (and there are 3 values in my array each of which should be 10MB on its own).
   
   > Because the code I've been using to access those strings doesn't return indexes, it returns actual strings:
   
   I'm a little confused by this point.  In the code below you are creating two pointers, `index` should be a pointer to the indices and `view` should be a pointer to the data.  This is how dictionary arrays are typically stored.  One buffer (usually with lots of elements) for indices and another buffer (usually with a small number of elements) for values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org