You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/12/20 13:20:00 UTC

[jira] [Comment Edited] (ARROW-15151) write_dataset() never increments {i} in partitions part-{i}

    [ https://issues.apache.org/jira/browse/ARROW-15151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462604#comment-17462604 ] 

David Li edited comment on ARROW-15151 at 12/20/21, 1:19 PM:
-------------------------------------------------------------

CC [~westonpace] in case he has something to add, but I think this is two things:
 * 6.0.x changed it so that the counter is reset for each partition; you will see the counter increment if a partition has multiple files, though. So the documentation there needs updating.
 * The new dataset writer in 6.0.x has options to limit the max file size (which should get you multiple files), but the option is not yet exposed to R. ARROW-13703 and its child tasks should fix that.

So what you want should be possible, it just needs some plumbing to get everything exposed (and sorry for the misleading docs).


was (Author: lidavidm):
CC [~westonpace] in case he has something to add, but I think this is two things:
 * 6.0.x changed it so that the counter is reset for each partition; you will see the counter increment if a partition has multiple files, though. So the documentation there needs updating.
 * The new dataset writer in 6.0.x has options to limit the max file size, but the option is not yet exposed to R. ARROW-13703 and its child tasks should fix that.

So what you want should be possible, it just needs some plumbing to get everything exposed (and sorry for the misleading docs).

>  write_dataset() never increments {i} in partitions part-{i}
> ------------------------------------------------------------
>
>                 Key: ARROW-15151
>                 URL: https://issues.apache.org/jira/browse/ARROW-15151
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>         Environment: Ubuntu 21.04
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Introducing partitioning in write_dataset() creates sub-folders just fine, but the lowest-level subfolder only ever contains a part-0.parquet.  I don't see how to get write_dataset() to ever generate output with multiple part-filenames in a single directory, like part-0.parquet, part-1.parquet, etc.  e.g. the documentation for open_dataset() implies we should get three `Z` level parts:
> {code:java}
> # You can also partition by the values in multiple columns
> # (here: "cyl" and "gear").
> # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
> two_levels_tree <- tempfile()
> write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
> list.files(two_levels_tree, recursive = TRUE)
> # In the two previous examples we would have:
> # X = {4,6,8}, the number of cylinders.
> # Y = {3,4,5}, the number of forward gears.
> # Z = {0,1,2}, the number of saved parts, starting from 0. {code}
> But I only get the expected structure with part-0.parquet files.
>  
>  
> Context: I frequently need to partition large files that lack any natural grouping variable; I merely want a bunch of small parts of equal size.  It would be great if there was an automatic way of doing this; currently I can hack this by creating a partition column with integers 1...n where n is my desired number of partitions, and partition on that.  I'd then like to write these to a flat structure with part-0.parquet, part-1.parquet etc, not a nested folder structure, if possible. 
> (Or better yet, it would be amazing if write_dataset() just let us set a maximum partition file size and could automate the sharding into parts while preserving the existing behavior for actually semantically meaningful groups.  Maybe that is already the intent but I cannot see how to activate it!)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)