You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Matthew Rocklin (JIRA)" <ji...@apache.org> on 2019/01/01 03:40:00 UTC

[jira] [Created] (ARROW-4139) Parquet Statistics on unicode text files have byte array type

Matthew Rocklin created ARROW-4139:
--------------------------------------

             Summary: Parquet Statistics on unicode text files have byte array type
                 Key: ARROW-4139
                 URL: https://issues.apache.org/jira/browse/ARROW-4139
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Matthew Rocklin


When writing Pandas data to Parquet format and reading it back again I find that that statistics of text columns are stored as byte arrays rather than as unicode text. 

I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding of how best to manage statistics.  (I'd be quite happy to learn that it was the latter).

Here is a minimal example

{code:python}
import pandas as pd
df = pd.DataFrame({'x': ['a']})
df.to_parquet('df.parquet')
import pyarrow.parquet as pq
pf = pq.ParquetDataset('df.parquet')
piece = pf.pieces[0]
rg = piece.row_group(0)
md = piece.get_metadata(pq.ParquetFile)
rg = md.row_group(0)
c = rg.column(0)

>>> c
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
  file_offset: 63
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 1
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
      has_min_max: True
      min: b'a'
      max: b'a'
      null_count: 0
      distinct_count: 0
      num_values: 1
      physical_type: BYTE_ARRAY
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 25
  total_compressed_size: 59
  total_uncompressed_size: 55

>>> type(c.statistics.min)
bytes
{code}

My guess is that we would want to store a logical type in the statistics like UNICODE, though I don't have enough experience with Parquet data types to know if this is a good idea or possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)