You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2021/04/18 23:41:00 UTC
[jira] [Resolved] (ARROW-12420) [C++/Dataset] Reading null columns
as dictionary not longer possible
[ https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs resolved ARROW-12420.
-------------------------------------
Resolution: Fixed
Issue resolved by pull request 10093
[https://github.com/apache/arrow/pull/10093]
> [C++/Dataset] Reading null columns as dictionary not longer possible
> --------------------------------------------------------------------
>
> Key: ARROW-12420
> URL: https://issues.apache.org/jira/browse/ARROW-12420
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 4.0.0
> Reporter: Uwe Korn
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Reading a dataset with a dictionary column where some of the files don't contain any data for that column (and thus are typed as null) broke with https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release though and thus I would consider this a regression.
> This can be reproduced using the following Python snippet:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({"a": [None, None]})
> pq.write_table(table, "test.parquet")
> schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
> fsds = ds.FileSystemDataset.from_paths(
> paths=["test.parquet"],
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> filesystem=pa.fs.LocalFileSystem(),
> )
> fsds.to_table()
> {code}
> The exception on master is currently:
> {code}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError Traceback (most recent call last)
> <ipython-input-14-5f0bc602f16b> in <module>
> 6 filesystem=pa.fs.LocalFileSystem(),
> 7 )
> ----> 8 fsds.to_table()
> ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
> 456 table : Table instance
> 457 """
> --> 458 return self._scanner(**kwargs).to_table()
> 459
> 460 def head(self, int num_rows, **kwargs):
> ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
> 2887 result = self.scanner.ToTable()
> 2888
> -> 2889 return pyarrow_wrap_table(GetResultValue(result))
> 2890
> 2891 def take(self, object indices):
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
> 140 nogil except -1:
> --> 141 return check_status(status)
> 142
> 143
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> 116 raise ArrowKeyError(message)
> 117 elif status.IsNotImplemented():
> --> 118 raise ArrowNotImplementedError(message)
> 119 elif status.IsTypeError():
> 120 raise ArrowTypeError(message)
> ArrowNotImplementedError: Unsupported cast from null to dictionary<values=string, indices=int32, ordered=0> (no available cast function for target type)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)