You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Matthew Rocklin (JIRA)" <ji...@apache.org> on 2019/01/01 03:40:00 UTC
[jira] [Created] (ARROW-4139) Parquet Statistics on unicode text
files have byte array type
Matthew Rocklin created ARROW-4139:
--------------------------------------
Summary: Parquet Statistics on unicode text files have byte array type
Key: ARROW-4139
URL: https://issues.apache.org/jira/browse/ARROW-4139
Project: Apache Arrow
Issue Type: Bug
Reporter: Matthew Rocklin
When writing Pandas data to Parquet format and reading it back again I find that that statistics of text columns are stored as byte arrays rather than as unicode text.
I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding of how best to manage statistics. (I'd be quite happy to learn that it was the latter).
Here is a minimal example
{code:python}
import pandas as pd
df = pd.DataFrame({'x': ['a']})
df.to_parquet('df.parquet')
import pyarrow.parquet as pq
pf = pq.ParquetDataset('df.parquet')
piece = pf.pieces[0]
rg = piece.row_group(0)
md = piece.get_metadata(pq.ParquetFile)
rg = md.row_group(0)
c = rg.column(0)
>>> c
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
file_offset: 63
file_path:
physical_type: BYTE_ARRAY
num_values: 1
path_in_schema: x
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
has_min_max: True
min: b'a'
max: b'a'
null_count: 0
distinct_count: 0
num_values: 1
physical_type: BYTE_ARRAY
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 25
total_compressed_size: 59
total_uncompressed_size: 55
>>> type(c.statistics.min)
bytes
{code}
My guess is that we would want to store a logical type in the statistics like UNICODE, though I don't have enough experience with Parquet data types to know if this is a good idea or possible.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)