You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/01/12 10:51:00 UTC
[jira] [Created] (ARROW-15307) [C++][Dataset] Provide more context in error message if cast fails during scanning

Joris Van den Bossche created ARROW-15307:
---------------------------------------------

             Summary: [C++][Dataset] Provide more context in error message if cast fails during scanning
                 Key: ARROW-15307
                 URL: https://issues.apache.org/jira/browse/ARROW-15307
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


If you have a partitioned dataset, and in one of the files there is a column with a mismatching type and that cannot be safely casted to the dataset schema's type for that column, you get (as expected) get an error about this cast. 

Small illustrative example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with two files

basedir = pathlib.Path(".") / "dataset_test_mismatched_schema"
basedir.mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "data1.parquet")

table2 = pa.table({'a': [1.5, 2.0, 3.0], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "data2.parquet")

## reading the dataset

dataset = ds.dataset(basedir)
# by default infer dataset schema from first file
dataset.schema
# actually reading gives expected error
dataset.to_table()
{code}

gives

{code:python}
>>> dataset.schema
a: int64
b: int64
>>> dataset.to_table()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-1-a2d19a590e3b> in <module>
     22 dataset.schema
     23 # actually reading gives expected error
---> 24 dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Float value 1.5 was truncated converting to int64

../src/arrow/compute/kernels/scalar_cast_numeric.cc:177  CheckFloatToIntTruncation(batch[0], *out)
../src/arrow/compute/exec.cc:700  kernel_->exec(kernel_ctx_, batch, &out)
../src/arrow/compute/exec.cc:641  ExecuteBatch(batch, listener)
../src/arrow/compute/function.cc:248  executor->Execute(implicitly_cast_args, &listener)
../src/arrow/compute/exec/expression.cc:444  compute::Cast(column, field->type(), compute::CastOptions::Safe())
../src/arrow/dataset/scanner.cc:816  compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value)
{code}

So the actual error message (without the extra C++ context) is only *"ArrowInvalid: Float value 1.5 was truncated converting to int64"*.

So this error message only says something about the two types and the first value that cannot be cast, but if you have a large dataset with many fragments and/or many columns, it can be hard to know 1) for which column this is failing and 2) for which fragment it is failing.

So it would be nice to add some extra context to the error message.  
The cast itself of course doesn't know it, but I suppose when doing the cast in the scanner code, there at least we know eg the physical schema and dataset schema, so we could append or prepend the error message with something like "Casting from schema1 to schema2 failed with ...". 

cc [~alenkaf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)