You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/12/20 10:32:00 UTC

[jira] [Commented] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

    [ https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462507#comment-17462507 ] 

Nicola Crane commented on ARROW-14653:
--------------------------------------

It looks like using the async scanner is going to be the default in the C++ soon anyway (in ARROW-13338), so I'm assuming from that that it's the best option anyway, and so have submitted a PR.

> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
>                 Key: ARROW-14653
>                 URL: https://issues.apache.org/jira/browse/ARROW-14653
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Nicola Crane
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this as I want to preview my dataset before I try to do anything with it that's going to be more expensive computationally.
> {code:r}
> library(arrow)
> library(dplyr)
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
>   head(1) %>%
>   collect()
> {code}
> I have experimented with different combinations of files in the dataset folder, and it seems to work fine when my total file size is <~600Mb but hang if it's above that.  This might not even be what that actual issue is but I'm struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)