You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Christian Thiel (JIRA)" <ji...@apache.org> on 2019/06/13 05:56:00 UTC
[jira] [Comment Edited] (ARROW-3861) [Python] ParquetDataset().read columns argument always returns partition column

    [ https://issues.apache.org/jira/browse/ARROW-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862735#comment-16862735 ] 

Christian Thiel edited comment on ARROW-3861 at 6/13/19 5:55 AM:
-----------------------------------------------------------------

[~jorisvandenbossche] thanks for the info.

Yes, my intention of "new_column" is for it to be added. This is however not primarily related to this issue. The code example above is just my usual testcase for my own code which modifies the dataframe to match a schema beforehand.

In my opinion the schema should be the single source of truth. Thus columns of the df which are not part of the schema should be dropped or raise an error. Columns which are not in the Dataframe should be added with the invalid value corresponding to the schema dtype (or raise an error again).

I am not sure how the index should be handled. I really do not like that we cannot specify the dtype there. I believe this is due to the index being saved in the metadata of parquet, which also implies that the information in the index, presumably the most important column, is not so easily available across platforms as a usual column. For all my applications I stopped writing the index to the parquet file and use a regular parquet column instead. If you make sure the column is the first column, the perfomance implication when using s3 are minimal as no seek needs to be performed. This is also supported by the fact that `write_to_dataset` no longer supports index preservation.

The only other major thing which is bothering me is that Ints can't be NaN. I really like the pandas Int64 columns. However as this is not supported by parquet yet as far as I know, this is a problem for another day.


was (Author: cthi):
[~jorisvandenbossche] thanks for the info.

Yes, my intention of "new_column" is for it to be added. This is however not primarily related to this issue. The code example above is just my usual testcase for my own code which modifies the dataframe to match a schema beforehand.

In my opinion the schema should be the single source of truth. Thus columns of the df which are not part of the schema should be dropped or raise an error. Columns which are not in the Dataframe should be added with the invalid value corresponding to the schema dtype (or raise an error again).

I am not sure how the index should be handled. I really do not like that we cannot specify the dtype there. I believe this is due to the index being saved in the metadata of parquet, which also implies that the information in the index, presumably the most important column, is not so easily available across platforms as a usual column. For all my applications I stopped writing the index to the parquet file and use a regular parquet column instead. If you make sure the column is the first column, the perfomance implication when using s3 are minimal as no seek needs to be performed. This is also supported by the fact that `write_to_dataset` no longer supports index preservation.

The only other major thing which is bothering me is that Ints can't be NaN. I really like the pandas Int64 columns. However as this is not supported by parquet yet as far as I know, this is a problem for a different day.

> [Python] ParquetDataset().read columns argument always returns partition column
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-3861
>                 URL: https://issues.apache.org/jira/browse/ARROW-3861
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Christian Thiel
>            Priority: Major
>              Labels: dataset, parquet, python
>             Fix For: 0.15.0
>
>
> I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns:
> {code}
> import dask as da
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
>     shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>                        ('partition_column', pa.int32()),
>                        ('arrays', pa.list_(pa.int32())),
>                        ('strings', pa.string()),
>                        ('new_column', pa.string())])
> table = pa.Table.from_pandas(df, schema=my_schema)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column'])
> df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()
> # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], engine='pyarrow')
> df_pq
> {code}
> df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)