You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/02/18 17:39:00 UTC
[jira] [Updated] (ARROW-7809) [R] vignette does not run on Win 10
nor ubuntu
[ https://issues.apache.org/jira/browse/ARROW-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-7809:
----------------------------------
Component/s: R
> [R] vignette does not run on Win 10 nor ubuntu
> ----------------------------------------------
>
> Key: ARROW-7809
> URL: https://issues.apache.org/jira/browse/ARROW-7809
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Zhuo Jia Dai
> Priority: Major
>
> On Win10
> {code:java}
> bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
> dir.create("nyc-taxi")
> for (year in 2018:2018) {
> if(!dir.exists(glue::glue("nyc-taxi/
> {year}/"))) {
> dir.create(glue::glue("nyc-taxi/{year}
> /"))
> }
> for (month in 1:12) {
> if (month < 10)
> { month <- paste0("0", month) }
> if(!dir.exists(glue::glue("nyc-taxi/
> {year}/{month}"))) {
> dir.create(glue::glue("nyc-taxi/{year}
> /
> {month}
> "))
> }
> try(download.file(
> paste(bucket, year, month, "data.parquet", sep = "/"),
> file.path("nyc-taxi", year, month, "data.parquet")
> ))
> }
> }
> aa = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
> {code}
> gives error
>
> {code:java}
> Error in dataset___FSSFactory__Make3(filesystem, selector, format, partitioning) :
> IOError: Could not open parquet input source 'nyc-taxi/2018/01/data.parquet': Couldn't deserialize thrift: TProtocolException: Invalid data
> In addition: Warning message:
> {code}
> On Ubuntu, running
> {code:java}
> library(dplyr)ds = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
> system.time(ds %>%
> filter(total_amount > 100, year == 2015) %>%
> select(tip_amount, total_amount, passenger_count) %>%
> group_by(passenger_count) %>%
> collect() %>%
> summarize(
> tip_pct = median(100 * tip_amount / total_amount),
> n = n()
> ) %>%
> print())
> {code}
> gives the following segfault
> {code:java}
> *** caught segfault ***
> address (nil), cause 'memory not mapped'Traceback:
> 1: Table__to_dataframe(x, use_threads = option_use_threads())
> 2: as.data.frame.Table(scanner_builder$Finish()$ToTable())
> 3: as.data.frame(scanner_builder$Finish()$ToTable())
> 4: collect.arrow_dplyr_query(.)
> 5: collect(.)
> 6: function_list[[i]](value)
> 7: freduce(value, `_function_list`)
> 8: `_fseq`(`_lhs`)
> 9: eval(quote(`_fseq`(`_lhs`)), env, env)
> 10: eval(quote(`_fseq`(`_lhs`)), env, env)
> 11: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
> 12: ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>% collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount), n = n()) %>% print()
> 13: system.time(ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>% collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount), n = n()) %>% print())
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)