You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/10/21 19:28:00 UTC

[jira] [Commented] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

    [ https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432689#comment-17432689 ] 

Weston Pace commented on ARROW-14422:
-------------------------------------

The python change should be pretty straightforward (although it will add yet another keyword option to a rather long list)

[~emkornfield] do you know off the top of your head if there are any further gotchas that will likely be encountered trying to create files for a parquet-mr version this old?  Is there a compatibility table anywhere with minimum version support?

> [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14422
>                 URL: https://issues.apache.org/jira/browse/ARROW-14422
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Kevin
>            Priority: Major
>
> I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
> to save it as parquet on disk (parquet version 1.4)
> colums
>  id : string
>  val : string
> table = pa.Table.from_pandas(df) 
>  pq.write_table(table, "df.parquet", version='1.0', flavor='spark', write_statistics=True, )
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*) )?\(build ?(.*)\)}}
> {{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
> {{ at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
> {{ at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>  
> It seems related to this issue:
>  
> It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow was even started. The underlying C++ code does allow this {{created_by}} field to be customized [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249] but the python wrapper does not expose this [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360]. 
>  
> It would be nice that pyarrow exposes this feature.
>  
> SO Question here:
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)