You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/12 18:10:00 UTC

[jira] [Comment Edited] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

    [ https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442890#comment-17442890 ] 

Nicola Crane edited comment on ARROW-14653 at 11/12/21, 6:09 PM:
-----------------------------------------------------------------

[~westonpace] Given it was me that found this when playing around with demo examples, and what you've said above about it likely getting resolved anyway, how about we just leave this as it is unless we find we have actual users affected by it? (Or if the proposed update to use the asynchronous scanner makes sense, do that.  Sorry - I don't fully understand!)


was (Author: thisisnic):
[~westonpace] Given it was me that found this when playing around with demo examples, and what you've said above about it likely getting resolved anyway, how about we just leave this as it is unless we find we have actual users affected by it?

> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
>                 Key: ARROW-14653
>                 URL: https://issues.apache.org/jira/browse/ARROW-14653
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Priority: Major
>
> I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this as I want to preview my dataset before I try to do anything with it that's going to be more expensive computationally.
> {code:r}
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
>   head(1) %>%
>   collect()
> {code}
> I have experimented with different combinations of files in the dataset folder, and it seems to work fine when my total file size is <~600Mb but hang if it's above that.  This might not even be what that actual issue is but I'm struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)