You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Renato Javier Marroquín Mogrovejo (JIRA)" <ji...@apache.org> on 2018/08/07 18:43:00 UTC

[jira] [Commented] (PARQUET-1372) [C++] Add an API to allow writing RowGroups based on their size rather than num_rows

    [ https://issues.apache.org/jira/browse/PARQUET-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572142#comment-16572142 ] 

Renato Javier Marroquín Mogrovejo commented on PARQUET-1372:
------------------------------------------------------------

This is a very useful feature indeed [~anatoli.shein] !
Just one quick clarification question, are you planning to implement fixed row group sizes? i.e., all of them having for example 8MB? If so, how are you planning to deal with record boundaries? Cutting them off? or making each row group size approximately what was configured? Thanks!

> [C++] Add an API to allow writing RowGroups based on their size rather than num_rows
> ------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1372
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1372
>             Project: Parquet
>          Issue Type: Task
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>            Priority: Major
>             Fix For: 1.5.0
>
>
> The current API allows writing RowGroups with specified numbers of rows, however does not allow writing RowGroups with specified size. In order to write RowGroups of specified size we need to write rows in chunks while checking the total_bytes_written after each chunk is written. This is currently impossible because the call to NextColumn() closes the current column writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)