You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Caleb Hattingh (Jira)" <ji...@apache.org> on 2020/09/15 00:41:00 UTC

[jira] [Created] (ARROW-10008) pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False

Caleb Hattingh created ARROW-10008:
--------------------------------------

             Summary: pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False
                 Key: ARROW-10008
                 URL: https://issues.apache.org/jira/browse/ARROW-10008
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 1.0.1, 0.17.1
         Environment: Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0]
Pandas version: 1.1.2
pyarrow version: 1.0.1

            Reporter: Caleb Hattingh


I apologise if this is a known issue; I looked both in this issue tracker and on github and I didn't find it.

There seems to be a problem reading a dataset with predicate pushdown (filters) on columns with categorical data. The problem only occurs with `use_legacy_dataset=False` (but if that's True it has no effect if the column isn't a partition key.

Reproducer:
{code:python}
import shutil
import sys, platform
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Settings
CATEGORICAL_DTYPE = True
USE_LEGACY_DATASET = False

print('Platform:', platform.platform())
print('Python version:', sys.version)
print('Pandas version:', pd.__version__)
print('pyarrow version:', pa.__version__)
print('categorical enabled:', CATEGORICAL_DTYPE)
print('use_legacy_dataset:', USE_LEGACY_DATASET)
print()

# Clean up test dataset if present
path = Path('blah.parquet')
if path.exists():
    shutil.rmtree(str(path))

# Simple data
d = dict(col1=['a', 'b'], col2=[1, 2])

# Either categorical or not
if CATEGORICAL_DTYPE:
    df = pd.DataFrame(data=d, dtype='category')
else:
    df = pd.DataFrame(data=d)

# Write dataset
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, str(path))

# Load dataset
table = pq.read_table(
    str(path),
    filters=[('col1', '=', 'a')],
    use_legacy_dataset=USE_LEGACY_DATASET,
)
df = table.to_pandas()
print(df.dtypes)
print(repr(df))

{code}
 Output:
{code:java}
$ python categorical_predicate_pushdown.py 
Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0]
Pandas version: 1.1.2
pyarrow version: 1.0.1
categorical enabled: True
use_legacy_dataset: False

/arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: dictionary<values=string, indices=int32, ordered=0> vs string
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZZN5arrow7dataset19GetScanTaskIteratorENS_8IteratorISt10shared_ptrINS0_8FragmentEEEES2_INS0_11ScanOptionsEES2_INS0_11ScanContextEEENKUlS4_E_clES4_+0x91)[0x7f50433485a1]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorINS0_ISt10shared_ptrINS_7dataset8ScanTaskEEEEE4NextINS_11MapIteratorIZNS2_19GetScanTaskIteratorENS0_IS1_INS2_8FragmentEEEES1_INS2_11ScanOptionsEES1_INS2_11ScanContextEEEUlSA_E_SA_S5_EEEENS_6ResultIS5_EEPv+0xde)[0x7f504334b55e]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow15FlattenIteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextEv+0x127)[0x7f50433616b7]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextINS_15FlattenIteratorIS4_EEEENS_6ResultIS4_EEPv+0x14)[0x7f5043361874]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset7Scanner7ToTableEv+0x611)[0x7f5043336691]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x3b150)[0x7f50435c9150]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2c0eb)[0x7f50435ba0eb]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2d9ab)[0x7f50435bb9ab]
python(PyCFunction_Call+0x56)[0x562843a6dce6]
python(_PyObject_MakeTpCall+0x22f)[0x562843a2b5cf]
python(_PyEval_EvalFrameDefault+0x11d7)[0x562843aaf727]
python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
python(+0x18bb80)[0x562843a79b80]
python(+0x1001e3)[0x5628439ee1e3]
python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
python(_PyFunction_Vectorcall+0x1e3)[0x562843a797a3]
python(+0x1001e3)[0x5628439ee1e3]
python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
python(PyEval_EvalCodeEx+0x44)[0x562843a795b4]
python(PyEval_EvalCode+0x1c)[0x562843b07bdc]
python(+0x219c84)[0x562843b07c84]
python(+0x24be94)[0x562843b39e94]
python(PyRun_FileExFlags+0xa1)[0x562843a0279a]
python(PyRun_SimpleFileExFlags+0x3b4)[0x562843a02b7f]
python(+0x115a44)[0x562843a03a44]
python(Py_BytesMain+0x39)[0x562843b3c9b9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5058f2a0b3]
python(+0x1dea83)[0x562843acca83]
Aborted (core dumped)
{code}
With `CATEGORICAL_DTYPE = False`, it works as expected:
{code:java}
$ python categorical_predicate_pushdown.py 
Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0]
Pandas version: 1.1.2
pyarrow version: 1.0.1
categorical enabled: False
use_legacy_dataset: Falsecol1    object
col2     int64
dtype: object
  col1  col2
0    a     1

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)