You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2020/09/28 16:22:00 UTC

[jira] [Assigned] (ARROW-10008) [Python] pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False

     [ https://issues.apache.org/jira/browse/ARROW-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Kietzman reassigned ARROW-10008:
------------------------------------

    Assignee: Ben Kietzman

> [Python] pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10008
>                 URL: https://issues.apache.org/jira/browse/ARROW-10008
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.1, 1.0.1
>         Environment: Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
> Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
> [GCC 7.3.0]
> Pandas version: 1.1.2
> pyarrow version: 1.0.1
>            Reporter: Caleb Hattingh
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: categorical, category, dataset, filters, parquet, predicate
>             Fix For: 2.0.0
>
>
> I apologise if this is a known issue; I looked both in this issue tracker and on github and I didn't find it.
> There seems to be a problem reading a dataset with predicate pushdown (filters) on columns with categorical data. The problem only occurs with `use_legacy_dataset=False` (but if that's True it has no effect if the column isn't a partition key.
> Reproducer:
> {code:python}
> import shutil
> import sys, platform
> from pathlib import Path
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # Settings
> CATEGORICAL_DTYPE = True
> USE_LEGACY_DATASET = False
> print('Platform:', platform.platform())
> print('Python version:', sys.version)
> print('Pandas version:', pd.__version__)
> print('pyarrow version:', pa.__version__)
> print('categorical enabled:', CATEGORICAL_DTYPE)
> print('use_legacy_dataset:', USE_LEGACY_DATASET)
> print()
> # Clean up test dataset if present
> path = Path('blah.parquet')
> if path.exists():
>     shutil.rmtree(str(path))
> # Simple data
> d = dict(col1=['a', 'b'], col2=[1, 2])
> # Either categorical or not
> if CATEGORICAL_DTYPE:
>     df = pd.DataFrame(data=d, dtype='category')
> else:
>     df = pd.DataFrame(data=d)
> # Write dataset
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, str(path))
> # Load dataset
> table = pq.read_table(
>     str(path),
>     filters=[('col1', '=', 'a')],
>     use_legacy_dataset=USE_LEGACY_DATASET,
> )
> df = table.to_pandas()
> print(df.dtypes)
> print(repr(df))
> {code}
>  Output:
> {code:java}
> $ python categorical_predicate_pushdown.py 
> Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
> Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
> [GCC 7.3.0]
> Pandas version: 1.1.2
> pyarrow version: 1.0.1
> categorical enabled: True
> use_legacy_dataset: False
> /arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: dictionary<values=string, indices=int32, ordered=0> vs string
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZZN5arrow7dataset19GetScanTaskIteratorENS_8IteratorISt10shared_ptrINS0_8FragmentEEEES2_INS0_11ScanOptionsEES2_INS0_11ScanContextEEENKUlS4_E_clES4_+0x91)[0x7f50433485a1]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorINS0_ISt10shared_ptrINS_7dataset8ScanTaskEEEEE4NextINS_11MapIteratorIZNS2_19GetScanTaskIteratorENS0_IS1_INS2_8FragmentEEEES1_INS2_11ScanOptionsEES1_INS2_11ScanContextEEEUlSA_E_SA_S5_EEEENS_6ResultIS5_EEPv+0xde)[0x7f504334b55e]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow15FlattenIteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextEv+0x127)[0x7f50433616b7]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextINS_15FlattenIteratorIS4_EEEENS_6ResultIS4_EEPv+0x14)[0x7f5043361874]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset7Scanner7ToTableEv+0x611)[0x7f5043336691]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x3b150)[0x7f50435c9150]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2c0eb)[0x7f50435ba0eb]
> /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2d9ab)[0x7f50435bb9ab]
> python(PyCFunction_Call+0x56)[0x562843a6dce6]
> python(_PyObject_MakeTpCall+0x22f)[0x562843a2b5cf]
> python(_PyEval_EvalFrameDefault+0x11d7)[0x562843aaf727]
> python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
> python(+0x18bb80)[0x562843a79b80]
> python(+0x1001e3)[0x5628439ee1e3]
> python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
> python(_PyFunction_Vectorcall+0x1e3)[0x562843a797a3]
> python(+0x1001e3)[0x5628439ee1e3]
> python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802]
> python(PyEval_EvalCodeEx+0x44)[0x562843a795b4]
> python(PyEval_EvalCode+0x1c)[0x562843b07bdc]
> python(+0x219c84)[0x562843b07c84]
> python(+0x24be94)[0x562843b39e94]
> python(PyRun_FileExFlags+0xa1)[0x562843a0279a]
> python(PyRun_SimpleFileExFlags+0x3b4)[0x562843a02b7f]
> python(+0x115a44)[0x562843a03a44]
> python(Py_BytesMain+0x39)[0x562843b3c9b9]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5058f2a0b3]
> python(+0x1dea83)[0x562843acca83]
> Aborted (core dumped)
> {code}
> With `CATEGORICAL_DTYPE = False`, it works as expected:
> {code:java}
> $ python categorical_predicate_pushdown.py 
> Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
> Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
> [GCC 7.3.0]
> Pandas version: 1.1.2
> pyarrow version: 1.0.1
> categorical enabled: False
> use_legacy_dataset: Falsecol1    object
> col2     int64
> dtype: object
>   col1  col2
> 0    a     1
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)