You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/10 10:35:00 UTC

[jira] [Created] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

Nicola Crane created ARROW-14653:
------------------------------------

             Summary: [R] head() hangs on CSV datasets > 600MB
                 Key: ARROW-14653
                 URL: https://issues.apache.org/jira/browse/ARROW-14653
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Nicola Crane


I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this as I want to preview my dataset before I try to do anything with it that's going to be more expensive computationally.

{code:r}
open_dataset("../../data/nyc-raw/", format = "csv") %>%
  head(1) %>%
  collect()
{code}

I have experimented with different combinations of files in the dataset folder, and it seems to work fine when my total file size is <~600Mb but hang if it's above that.  This might not even be what that actual issue is but I'm struggling to narrow it down beyond add extra files to the equation.

I've tried running with with the C++ debugger attached, but again, it just hangs.

The files I'm using are the 2020-2021 Yellow Taxi trip records available from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

A bit of investigation has shown me that I can load in different subsets of files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)