You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Michael Eaton (JIRA)" <ji...@apache.org> on 2019/04/16 16:23:00 UTC

[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set

    [ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819225#comment-16819225 ] 

Michael Eaton commented on ARROW-4139:
--------------------------------------

Adding a decode("utf8") to ParquetType_BYTE_ARRAY seems simple enough and yields min,max of type utf8.

I can only guess that this breaks statistics for logical types dependent on ParquetType_BYTE_ARRAY.  What are these types, and how to go about adding a logical accessor if necessary?

> [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-4139
>                 URL: https://issues.apache.org/jira/browse/ARROW-4139
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Matthew Rocklin
>            Priority: Minor
>              Labels: parquet, python
>             Fix For: 0.14.0
>
>
> When writing Pandas data to Parquet format and reading it back again I find that that statistics of text columns are stored as byte arrays rather than as unicode text. 
> I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding of how best to manage statistics.  (I'd be quite happy to learn that it was the latter).
> Here is a minimal example
> {code:python}
> import pandas as pd
> df = pd.DataFrame({'x': ['a']})
> df.to_parquet('df.parquet')
> import pyarrow.parquet as pq
> pf = pq.ParquetDataset('df.parquet')
> piece = pf.pieces[0]
> rg = piece.row_group(0)
> md = piece.get_metadata(pq.ParquetFile)
> rg = md.row_group(0)
> c = rg.column(0)
> >>> c
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
>   file_offset: 63
>   file_path: 
>   physical_type: BYTE_ARRAY
>   num_values: 1
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
>     <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
>       has_min_max: True
>       min: b'a'
>       max: b'a'
>       null_count: 0
>       distinct_count: 0
>       num_values: 1
>       physical_type: BYTE_ARRAY
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
>   has_dictionary_page: True
>   dictionary_page_offset: 4
>   data_page_offset: 25
>   total_compressed_size: 59
>   total_uncompressed_size: 55
> >>> type(c.statistics.min)
> bytes
> {code}
> My guess is that we would want to store a logical type in the statistics like UNICODE, though I don't have enough experience with Parquet data types to know if this is a good idea or possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)