You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/09/13 14:35:00 UTC

[jira] [Comment Edited] (ARROW-13982) dataset scanner stalls when reading parquet with filtering.

    [ https://issues.apache.org/jira/browse/ARROW-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414214#comment-17414214 ] 

David Li edited comment on ARROW-13982 at 9/13/21, 2:34 PM:
------------------------------------------------------------

From digging around I don't think we can avoid sending an empty batch, given the pipeline doesn't really have any other way to carry information around. Unfortunately this will mean you might get the occasional empty batch, at least from the unordered scan (we could filter them out in the ordered scan if we really cared). This will also happen with empty CSV/Feather files (and presumably ORC as well).


was (Author: lidavidm):
From digging around I don't think we can avoid sending an empty batch, given the pipeline doesn't really have any other way to carry information around. Unfortunately this will mean you might get the occasional empty batch, at least from the unordered scan (we could filter them out in the ordered scan if we really cared). I think we should also check if this case is possible with empty CSV/Feather/ORC files.

> dataset scanner stalls when reading parquet with filtering.
> -----------------------------------------------------------
>
>                 Key: ARROW-13982
>                 URL: https://issues.apache.org/jira/browse/ARROW-13982
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 5.0.0
>         Environment: ubuntu 18.04 LTS
>            Reporter: Huxley Hu
>            Priority: Major
>              Labels: query-engine
>             Fix For: 6.0.0
>
>         Attachments: repro.py
>
>
> Reading parquet files using dataset scanner may stall due to a never-finished future. 
> To reproduce this case, one needs two parquet files and sets the filter expression to something that could filter one file completely.  After that, calling `AsyncScanner::ToRecordBatchReader` and read data continually. 
> I also have dug this bug a little. It's caused by the `MakeEmptyGenerator<std::shared_ptr<RecordBatch>>` when filtered row groups is empty, which's ignored by `FragmentToBatches` and causes SequencingGenerator to stall.
> A quick fix is to return a record batch with 0 rows instead of returning a nullptr there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)