You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/04/20 22:40:00 UTC

[jira] [Comment Edited] (ARROW-12487) [C++][Dataset] ScanBatches() hangs if there's an error during scanning

    [ https://issues.apache.org/jira/browse/ARROW-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326132#comment-17326132 ] 

David Li edited comment on ARROW-12487 at 4/20/21, 10:39 PM:
-------------------------------------------------------------

This is indeed a regression from 3.0 to 4.0. See the attached file and this script:

{code:java}
import pyarrow
import pyarrow.csv
import pyarrow.dataset

root = "test.csv"
ds = pyarrow.dataset.dataset(root, format="csv")
fragments = ds.get_fragments()
fragment = next(fragments)
# Immediately errors in 3.0, hangs forever in 4.0
print(list(fragment.to_batches()))
{code}


was (Author: lidavidm):
This is indeed a regression from 3.0 to 4.0. See the attached file and this script:

{code:java}
import pyarrow
import pyarrow.csv
import pyarrow.dataset

root = " [^test.csv] test.csv"
ds = pyarrow.dataset.dataset(root, format="csv")
fragments = ds.get_fragments()
fragment = next(fragments)
# Immediately errors in 3.0, hangs forever in 4.0
print(list(fragment.to_batches()))
{code}

> [C++][Dataset] ScanBatches() hangs if there's an error during scanning
> ----------------------------------------------------------------------
>
>                 Key: ARROW-12487
>                 URL: https://issues.apache.org/jira/browse/ARROW-12487
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 4.0.0
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>              Labels: dataset, datasets, pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Errors during scanning aren't properly reported, causing the iterator to hang forever.
> This affects ScanBatches() and anything built on top of it (Python to_batches, TakeRows, etc)
> Verified on the 4.0.0 RC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)