You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ian Cook (Jira)" <ji...@apache.org> on 2021/09/02 20:11:00 UTC
[jira] [Commented] (ARROW-13860) [R] arrow 5.0.0 write_parquet
throws error writing grouped data.frame
[ https://issues.apache.org/jira/browse/ARROW-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409102#comment-17409102 ]
Ian Cook commented on ARROW-13860:
----------------------------------
Thanks for the report!
I dug into this and observed that it is happening because
{code:java}
write_parquet(x, ...) {code}
calls
{code:java}
x <- Table$create(){code}
which changes {{x}} into an {{arrow_dplyr_query}} because {{x}} has groups.
Then it calls
{code:java}
is_writable_table(x){code}
which triggers an error because {{x}} does not inherit {{data.frame}} or {{ArrowTabular}}.
In version 4.0.0 of the arrow package, this did not trigger an error because the {{is_writable_table(x)}} function did not exist. It was introduced in #10387: [https://github.com/apache/arrow/commit/2e3a25e5f1329929e0fdb88ecc76bf404a5ccf57#diff-f6235d4767fc4a7ee1bb726d816b9742ef0bc07503dceb678fd3bc55ee15454b]
But I am confused: Before ARROW-11769, I thought groups were lost when a grouped R data.frame was converted to a {{Table}}. So how is it that in the example above, the groups were seemingly written to the Parquet file and read back in? Didn't we always call {{Table$create()}} on the input to {{write_parquet()}} so shouldn't the groups have been lost?
cc [~jonkeane] [~thisisnic]
> [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame
> ---------------------------------------------------------------------
>
> Key: ARROW-13860
> URL: https://issues.apache.org/jira/browse/ARROW-13860
> Project: Apache Arrow
> Issue Type: Bug
> Environment: maxOS 11.1 Big Sur
> Reporter: Hideaki Hayashi
> Priority: Major
>
> arrow 5.0.0 write_parquet throws error writing grouped data.frame.
> Here is how to reproduce it.
> {{library(dplyr)}}
> {{ arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{# Error: x must be an object of class 'data.frame', 'RecordBatch', or 'Table', not 'arrow_dplyr_query’.}}
>
> With arrow 4.0.1, this used to work fine.
> {{library(dplyr)}}
> {{arrow::write_parquet(mtcars %>% group_by(am),"/tmp/mtcars_test.parquet")}}
> {{x <- arrow::read_parquet("/tmp/mtcars_test.parquet")}}
> {{x}}
> {{# A tibble: 32 x 11}}
> {{# Groups: am [2]}}
> {{# mpg cyl disp hp drat wt qsec vs am gear carb}}
> {{# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>}}
> {{# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4}}
> {{# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4}}
> {{# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1}}
> {{# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1}}
> {{# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2}}
> {{# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1}}
> {{# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4}}
> {{# …}}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)