You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Uwe Korn (Jira)" <ji...@apache.org> on 2021/04/16 13:02:00 UTC

[jira] [Created] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible

Uwe Korn created ARROW-12420:
--------------------------------

             Summary: [C++/Dataset] Reading null columns as dictionary not longer possible
                 Key: ARROW-12420
                 URL: https://issues.apache.org/jira/browse/ARROW-12420
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
    Affects Versions: 4.0.0
            Reporter: Uwe Korn
             Fix For: 4.0.0


Reading a dataset with a dictionary column where some of the files don't contain any data for that column (and thus are typed as null) broke with https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release though and thus I would consider this a regression.

This can be reproduced using the following Python snippet:

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({"a": [None, None]})
pq.write_table(table, "test.parquet")
schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
fsds = ds.FileSystemDataset.from_paths(
    paths=["test.parquet"],
    schema=schema,
    format=pa.dataset.ParquetFileFormat(),
    filesystem=pa.fs.LocalFileSystem(),
)
fsds.to_table()
{code}

The exception on master is currently:

{code}
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-14-5f0bc602f16b> in <module>
      6     filesystem=pa.fs.LocalFileSystem(),
      7 )
----> 8 fsds.to_table()

~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
    456         table : Table instance
    457         """
--> 458         return self._scanner(**kwargs).to_table()
    459 
    460     def head(self, int num_rows, **kwargs):

~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
   2887             result = self.scanner.ToTable()
   2888 
-> 2889         return pyarrow_wrap_table(GetResultValue(result))
   2890 
   2891     def take(self, object indices):

~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
    139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
    140         nogil except -1:
--> 141     return check_status(status)
    142 
    143 

~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
    116             raise ArrowKeyError(message)
    117         elif status.IsNotImplemented():
--> 118             raise ArrowNotImplementedError(message)
    119         elif status.IsTypeError():
    120             raise ArrowTypeError(message)

ArrowNotImplementedError: Unsupported cast from null to dictionary<values=string, indices=int32, ordered=0> (no available cast function for target type)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)