You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/04/09 14:29:00 UTC
[jira] [Assigned] (ARROW-12314) [Python] pq.read_pandas with use_legacy_dataset=False does not accept columns as a set (kartothek integration failure)

     [ https://issues.apache.org/jira/browse/ARROW-12314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reassigned ARROW-12314:
---------------------------------------------

    Assignee: Joris Van den Bossche

> [Python] pq.read_pandas with use_legacy_dataset=False does not accept columns as a set (kartothek integration failure)
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12314
>                 URL: https://issues.apache.org/jira/browse/ARROW-12314
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>             Fix For: 4.0.0
>
>
> The kartothek nightly integration builds started to fail(https://github.com/ursacomputing/crossbow/runs/2303373464), I assume because of ARROW-11464 (https://github.com/apache/arrow/pull/9910).
> It seems that in the new ParquetDatasetV2 (what you get with {{use_legacy_dataset=False}}), the handling of the columns argument is slightly different.
> Example failure:
> {code}
> _____________________ test_add_column_to_existing_index[4] _____________________
> store_factory = functools.partial(<function get_store_from_url at 0x7faf12e9d0e0>, 'hfs:///tmp/pytest-of-root/pytest-0/test_add_column_to_existing_in1/store')
> metadata_version = 4
> bound_build_dataset_indices = <function build_dataset_indices at 0x7faf0c509830>
>     def test_add_column_to_existing_index(
>         store_factory, metadata_version, bound_build_dataset_indices
>     ):
>         dataset_uuid = "dataset_uuid"
>         partitions = [
>             pd.DataFrame({"p": [1, 2], "x": [100, 4500]}),
>             pd.DataFrame({"p": [4, 3], "x": [500, 10]}),
>         ]
>     
>         dataset = store_dataframes_as_dataset(
>             dfs=partitions,
>             store=store_factory,
>             dataset_uuid=dataset_uuid,
>             metadata_version=metadata_version,
>             secondary_indices="p",
>         )
>         assert dataset.load_all_indices(store=store_factory()).indices.keys() == {"p"}
>     
>         # Create indices
> >       bound_build_dataset_indices(store_factory, dataset_uuid, columns=["x"])
> kartothek/io/testing/index.py:88: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> /opt/conda/envs/arrow/lib/python3.7/site-packages/decorator.py:231: in fun
>     return caller(func, *(extras + args), **kw)
> kartothek/io_components/utils.py:277: in normalize_args
>     return _wrapper(*args, **kwargs)
> kartothek/io_components/utils.py:275: in _wrapper
>     return function(*args, **kwargs)
> kartothek/io/eager.py:706: in build_dataset_indices
>     mp = mp.load_dataframes(store=ds_factory.store, columns=cols_to_load,)
> kartothek/io_components/metapartition.py:150: in _impl
>     method_return = method(mp, *method_args, **method_kwargs)
> kartothek/io_components/metapartition.py:696: in load_dataframes
>     date_as_object=dates_as_object,
> kartothek/serialization/_generic.py:122: in restore_dataframe
>     date_as_object=date_as_object,
> kartothek/serialization/_parquet.py:302: in restore_dataframe
>     date_as_object=date_as_object,
> kartothek/serialization/_parquet.py:249: in _restore_dataframe
>     table = pq.read_pandas(reader, columns=columns)
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1768: in read_pandas
>     source, columns=columns, use_pandas_metadata=True, **kwargs
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1730: in read_table
>     use_pandas_metadata=use_pandas_metadata)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = <pyarrow.parquet._ParquetDatasetV2 object at 0x7faee1ed9550>
> columns = {'x'}, use_threads = True, use_pandas_metadata = True
>     def read(self, columns=None, use_threads=True, use_pandas_metadata=False):
>         """
>         Read (multiple) Parquet files as a single pyarrow.Table.
>     
>         Parameters
>         ----------
>         columns : List[str]
>             Names of columns to read from the dataset. The partition fields
>             are not automatically included (in contrast to when setting
>             ``use_legacy_dataset=True``).
>         use_threads : bool, default True
>             Perform multi-threaded column reads.
>         use_pandas_metadata : bool, default False
>             If True and file has custom pandas schema metadata, ensure that
>             index columns are also loaded.
>     
>         Returns
>         -------
>         pyarrow.Table
>             Content of the file as a table (of columns).
>         """
>         # if use_pandas_metadata, we need to include index columns in the
>         # column selection, to be able to restore those in the pandas DataFrame
>         metadata = self.schema.metadata
>         if columns is not None and use_pandas_metadata:
>             if metadata and b'pandas' in metadata:
>                 # RangeIndex can be represented as dict instead of column name
>                 index_columns = [
>                     col for col in _get_pandas_index_columns(metadata)
>                     if not isinstance(col, dict)
>                 ]
> >               columns = columns + list(set(index_columns) - set(columns))
> E               TypeError: unsupported operand type(s) for +: 'set' and 'list'
> /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet.py:1598: TypeError
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)