You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Tanya Schlusser (JIRA)" <ji...@apache.org> on 2018/12/21 00:31:00 UTC
[jira] [Commented] (ARROW-3020) [Python] Addition of option to allow empty Parquet row groups

    [ https://issues.apache.org/jira/browse/ARROW-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726349#comment-16726349 ] 

Tanya Schlusser commented on ARROW-3020:
----------------------------------------

I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's {{ParquetWriter}} eventually uses this {{FileWriter}} class, and here's the current code (also [linked here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/arrow/writer.cc#L1110-L1129]). 

{code:title=from "parquet/arrow/writer.h"}
Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
    return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
    chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
    int64_t offset = chunk * chunk_size;
    int64_t size = std::min(chunk_size, table.num_rows() - offset);


    RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
    for (int i = 0; i < table.num_columns(); i++) {
      auto chunked_data = table.column(i)->data();
      RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
                         PARQUET_IGNORE_NOT_OK(Close()));
    }
  }
  return Status::OK();
}
{code}

> [Python] Addition of option to allow empty Parquet row groups
> -------------------------------------------------------------
>
>                 Key: ARROW-3020
>                 URL: https://issues.apache.org/jira/browse/ARROW-3020
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: Alex Mendelson
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.12.0
>
>
> While our use case is not common, I was able to find one related request from roughly a year ago. Could this be added as a feature?
> https://issues.apache.org/jira/browse/PARQUET-1047
> *Motivation*
> We have an application where each row is associated with one of N contexts, though a minority of contexts may have no associated rows. When encountering the Nth context, we will wish to retrieve all the associated rows. Row groups would provide a natural way to index the data, as the nth context could naturally relate to the nth row group.
> Unfortunately, this is not possible at the present time, as pyarrow does not support writing empty row groups. If one writes a pyarrow.Table containing zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final file, and this distorts the indexing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)