You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/09/01 11:08:52 UTC

[GitHub] [arrow] thisisnic opened a new issue, #37513: [R] DecodeRowGroups segfault calling dplyr::glimpse() on a large dataset

thisisnic opened a new issue, #37513:
URL: https://github.com/apache/arrow/issues/37513

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ```
   > library(arrow)
   > library(dplyr)
   > nyc_taxi <- open_dataset("~/data/nyc-taxi")
   > glimpse(nyc_taxi)
   FileSystemDataset with 122 Parquet files
   [New Thread 0x7fffe2ffd6c0 (LWP 214018)]
   [New Thread 0x7fffe21ff6c0 (LWP 214019)]
   1,155,795,912 rows x 24 columns
   [New Thread 0x7fff707ff6c0 (LWP 214020)]
   [New Thread 0x7fff67fff6c0 (LWP 214021)]
   [New Thread 0x7fff6d3ff6c0 (LWP 214022)]
   [New Thread 0x7fff6cbfe6c0 (LWP 214023)]
   [New Thread 0x7fff677fe6c0 (LWP 214024)]
   [New Thread 0x7fff66ffd6c0 (LWP 214025)]
   [New Thread 0x7fff667fc6c0 (LWP 214026)]
   [New Thread 0x7fff65ffb6c0 (LWP 214027)]
   [New Thread 0x7fff657fa6c0 (LWP 214028)]
   [New Thread 0x7fff64ff96c0 (LWP 214029)]
   [New Thread 0x7fff3ffff6c0 (LWP 214030)]
   [New Thread 0x7fff377fe6c0 (LWP 214031)]
   [New Thread 0x7fff3f7fe6c0 (LWP 214032)]
   [New Thread 0x7fff3effd6c0 (LWP 214033)]
   [New Thread 0x7fff3e7fc6c0 (LWP 214034)]
   [New Thread 0x7fff37fff6c0 (LWP 214035)]
   [New Thread 0x7fff18fff6c0 (LWP 214036)]
   [New Thread 0x7fff07fff6c0 (LWP 214037)]
   $ vendor_name             <string> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CM…
   $ pickup_datetime  <timestamp[ms]> 2012-01-20 14:09:36, 2012-01-20 14:54:10, 201…
   $ dropoff_datetime <timestamp[ms]> 2012-01-20 14:42:25, 2012-01-20 15:06:55, 201…
   $ passenger_count          <int64> 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 3, 1, 1, 2, 1, …
   $ trip_distance           <double> 11.10, 2.40, 0.50, 0.70, 0.70, 0.70, 2.60, 4.…
   $ pickup_longitude        <double> -74.00585, -73.98633, -73.98338, -73.98165, -…
   $ pickup_latitude         <double> 40.72679, 40.75760, 40.76664, 40.74691, 40.74…
   $ rate_code               <string> "Standard rate", "Standard rate", "Standard r…
   $ store_and_fwd           <string> "No", "No", "No", "No", "No", "No", "No", "No…
   $ dropoff_longitude       <double> -73.86435, -74.00577, -73.99028, -73.98935, -…
   $ dropoff_latitude        <double> 40.77001, 40.72636, 40.76496, 40.73708, 40.73…
   $ payment_type            <string> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
   $ fare_amount             <double> 29.7, 9.3, 4.1, 4.5, 4.5, 4.1, 9.7, 12.1, 5.3…
   $ extra                   <double> 0.0, 0.0, 0.0, 0.0, 0.5, 1.0, 0.5, 0.5, 0.5, …
   $ mta_tax                 <double> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, …
   $ tip_amount              <double> 6.04, 0.00, 1.38, 1.00, 0.00, 0.00, 0.00, 0.0…
   $ tolls_amount            <double> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
   $ total_amount            <double> 36.24, 9.80, 5.98, 6.00, 5.50, 5.60, 10.70, 1…
   $ improvement_surcharge   <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
   $ congestion_surcharge    <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
   $ pickup_location_id       <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
   $ dropoff_location_id      <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
   $ year                     <int32> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
   $ month                    <int32> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
   > 
   Thread 15 "R" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff667fc6c0 (LWP 214026)]
   0x00007fffef85bd41 in parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::internal::Executor*) ()
   
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #37513: [R] DecodeRowGroups segfault calling dplyr::glimpse() on a large dataset

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1702687132

   I don't think it's releated...none of that code (in theory/to my knowledge) gets called when going through dplyr (that only applies to `read_parquet()` and variants).
   
   I don't have an initial though on why this would happen. If you remind me where the instructions are for how to get the taxi dataset I can try to reproduce.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #37513: [R] DecodeRowGroups segfault calling dplyr::glimpse() on a large dataset

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1702579656

   @paleolimbot Reckon this is related to the issues in #37274


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #37513: [R][C++] DecodeRowGroups segfault calling dplyr::glimpse() or head() on a large dataset

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1706567978

   Thanks!
   
   For me this fails with:
   
   ```
   > library(arrow)
   Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
   
   Attaching package: ‘arrow’
   
   The following object is masked from ‘package:utils’:
   
       timestamp
   
   > for(i in 1:100){
   +   open_dataset("~/Desktop/nyc-taxi/") |>
   +     head()
   + }
   
    *** caught bus error ***
   address 0x910043fda9017bfd, cause 'invalid alignment'
   
    *** caught bus error ***
   address 0x910043fda9017bfd, cause 'invalid alignment'
   
   Traceback:
    1: dataset___Scanner__head(x, floor(class = n))
    2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
    5: pairlist(srcfile = <environment>, class = "srcref")
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   
   Traceback:
    1: dataset___Scanner__head(x, floor(class = n))
    2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
    5: 
    *** caught bus error ***
   address 0x910043fda9017bfd, cause 'invalid alignment'
   
    *** caught bus error ***
   
    *** caught bus error ***
   address 0x910043fda9017bfd, cause 'invalid alignment'
   
   Traceback:
   
    *** caught bus error ***
   address 0x910043fda9017bfd, cause 'invalid alignment'
   
   Traceback:
    1: dataset___Scanner__head(x, floor(n))
    2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
   
   Traceback:
    1: head(open_dataset("~/Desktop/nyc-taxi/"))
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   
    *** caught bus error ***
   Selection: address 0x910043fda9017bfd, cause 'invalid alignment'
   Selection:  1: dataset___Scanner__head(x, floor(n))
    2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
    5: dataset___Scanner__head(x, floor(n))
   Traceback:
   head(open_dataset("~/Desktop/nyc-taxi/"))
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   
   head(open_dataset("~/Desktop/nyc-taxi/"))
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
    1: address 0x910043fda9017bfd, cause 'invalid alignment'
   dataset___Scanner__head(x, floor(n))Selection: 
    2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
    5: head(open_dataset("~/Desktop/nyc-taxi/"))
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   Selection: 
   Traceback:
    1: dataset___Scanner__head(x, floor(n))
    2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
    5: head(open_dataset("~/Desktop/nyc-taxi/"))
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   Selection:  2: head.Scanner(Scanner$create(x), n)
    3: head(Scanner$create(x), n)
    4: head.Dataset(open_dataset("~/Desktop/nyc-taxi/"))
    5: head(open_dataset("~/Desktop/nyc-taxi/"))
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   Selection: zsh: trace trap  R
   ```
   
   I think I know what might be happening (PR shortly!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #37513: [R] DecodeRowGroups segfault calling dplyr::glimpse() on a large dataset

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1702756063

   Thanks!  Something like this will do it:
   ```
   open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
     filter(year %in% 2012:2021) |>
     write_dataset("nyc-taxi", partitioning = c("year", "month"))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #37513: [R] DecodeRowGroups segfault calling dplyr::glimpse() on a large dataset

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1705387285

   I've got a super-simple reprex now, though unfortunately it requires a large dataset to reproduce it:
   
   ```
   library(arrow)
   for(i in 1:100){
     open_dataset("~/data/nyc-taxi/") |>
       head()
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #37513: [R] DecodeRowGroups segfault calling dplyr::glimpse() on a large dataset

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1702593572

   I've tried to run it on that branch and thought it was all good, but after running `dplyr::glimpse()` and `head()` alternately a few times, I got another segfault:
   
   ```
   Thread 17 "R" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff43fff6c0 (LWP 217159)]
   0x00007fffe949863b in arrow::internal::Executor::Submit<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, long unsigned int&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(arrow::internal::TaskHints, arrow::StopToken, struct {...} &) (this=0x7fffe99ca988 <vtable for parquet::DataPage+16>, hints=..., stop_token=..., func=...)
       at /home/nic/arrow/cpp/src/arrow/util/thread_pool.h:161
   161	    ARROW_RETURN_NOT_OK(SpawnReal(hints, std::move(task), std::move(stop_token),
   (gdb) 
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #37513: [R][C++] DecodeRowGroups segfault calling dplyr::glimpse() or head() on a large dataset

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1706971749

   (That PR didn't fix it!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [R][C++] DecodeRowGroups segfault calling dplyr::glimpse() or head() on a large dataset [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #37513:
URL: https://github.com/apache/arrow/issues/37513#issuecomment-1752698746

   @paleolimbot @amoeba I've downgraded this from blocker to critical, as it only occurs when things are run in quick succession, which is pretty bad, but probably not enough to hold up the release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org