You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/01/12 10:51:00 UTC
[jira] [Created] (ARROW-15307) [C++][Dataset] Provide more context in error message if cast fails during scanning
Joris Van den Bossche created ARROW-15307:
---------------------------------------------
Summary: [C++][Dataset] Provide more context in error message if cast fails during scanning
Key: ARROW-15307
URL: https://issues.apache.org/jira/browse/ARROW-15307
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
If you have a partitioned dataset, and in one of the files there is a column with a mismatching type and that cannot be safely casted to the dataset schema's type for that column, you get (as expected) get an error about this cast.
Small illustrative example code:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pathlib
## constructing a small dataset with two files
basedir = pathlib.Path(".") / "dataset_test_mismatched_schema"
basedir.mkdir(exist_ok=True)
table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "data1.parquet")
table2 = pa.table({'a': [1.5, 2.0, 3.0], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "data2.parquet")
## reading the dataset
dataset = ds.dataset(basedir)
# by default infer dataset schema from first file
dataset.schema
# actually reading gives expected error
dataset.to_table()
{code}
gives
{code:python}
>>> dataset.schema
a: int64
b: int64
>>> dataset.to_table()
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-1-a2d19a590e3b> in <module>
22 dataset.schema
23 # actually reading gives expected error
---> 24 dataset.to_table()
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Float value 1.5 was truncated converting to int64
../src/arrow/compute/kernels/scalar_cast_numeric.cc:177 CheckFloatToIntTruncation(batch[0], *out)
../src/arrow/compute/exec.cc:700 kernel_->exec(kernel_ctx_, batch, &out)
../src/arrow/compute/exec.cc:641 ExecuteBatch(batch, listener)
../src/arrow/compute/function.cc:248 executor->Execute(implicitly_cast_args, &listener)
../src/arrow/compute/exec/expression.cc:444 compute::Cast(column, field->type(), compute::CastOptions::Safe())
../src/arrow/dataset/scanner.cc:816 compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value)
{code}
So the actual error message (without the extra C++ context) is only *"ArrowInvalid: Float value 1.5 was truncated converting to int64"*.
So this error message only says something about the two types and the first value that cannot be cast, but if you have a large dataset with many fragments and/or many columns, it can be hard to know 1) for which column this is failing and 2) for which fragment it is failing.
So it would be nice to add some extra context to the error message.
The cast itself of course doesn't know it, but I suppose when doing the cast in the scanner code, there at least we know eg the physical schema and dataset schema, so we could append or prepend the error message with something like "Casting from schema1 to schema2 failed with ...".
cc [~alenkaf]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)