You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/10/12 13:14:00 UTC

[jira] [Commented] (ARROW-18001) [Python] parquet.write_table/parquet.ParquetWriter should except a subset of columns

    [ https://issues.apache.org/jira/browse/ARROW-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616413#comment-17616413 ] 

Joris Van den Bossche commented on ARROW-18001:
-----------------------------------------------

Some background: in this specific case when using the _pandas_ (or dask) {{to_parquet}} method, the {{schema}} keyword gets passed to {{Table.from_pandas}}, and not the actual parquet write methods. 
In general, the type inference happens when converting your python object (eg pandas dataframe, or a dict, ..) to an Arrow Table, and once you have such table with a fixed schema, writing to Parquet doesn't do type inference anymore (since arrow types map to parquet types). 

So I think we should reframe the issue as providing a way to specify the type of a subset of columns for {{from_pandas}}.

Doing a small search for other JIRAs, I noticed that at some point in the past we actually did support a partial schema (this was accidentally broken at some point and then fixed again: ARROW-1125, although in the PR it was already noted that we might prefer doing this in another way: https://github.com/apache/arrow/pull/790#discussion_r124543809).  
Afterwards, the behaviour was changed again (intentionally) in ARROW-3766, now honoring the exact schema as passed (https://github.com/apache/arrow/pull/2979#discussion_r234010810)

> [Python] parquet.write_table/parquet.ParquetWriter should except a subset of columns
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-18001
>                 URL: https://issues.apache.org/jira/browse/ARROW-18001
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Alenka Frim
>            Priority: Major
>
> This question came up in the GitHub issue: [https://github.com/apache/arrow/issues/14025] and it would be a good improvement to the Parquet part of PyArrow. Haven't found any existing issue and so created a new one.
> h6. Description:
> If a user wants to change a type of one single column when using {{{}parquet.write_table{}}}/{{{}parquet.ParquetWriter{}}} they currently need to specify the schema with all columns included. If a column is not specified in the schema, it will not be included in the parquet file.
> h6. Proposal
> There should be a possibility for {{parquet.ParquetWriter}} excepting a subset of columns in a Schema and infer everything else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)