You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/06/25 09:03:21 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

mapleFU opened a new pull request, #36286:
URL: https://github.com/apache/arrow/pull/36286

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Export C++ WriteRecordBatch in Python API to allow memory saving
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   * [x] Add `write_record_batch` in pyx
   * [ ] Export to `ParquetWriter` api
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   * [ ] Will do
   
   ### Are there any user-facing changes?
   
   user can using it
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1609882455

   >  Can you explain which memory saving you're talking about? Can you show an example?
   
   When user writes to parquet table row-by-row, we usally batch the rows, then converting the batch to arrow, and write arrow batch to parquet, and when file is large enough, we will close writer and write to oss. Usally when we write to parquet, `parquet::arrow::WriteTable` is finally called, it will finish RowGroup once. So our files end with hundred or even thousands of RowGroups, which causing Metadata grows huge.
   
   This patch exports `parquet::arrow::FileWriter::WriteBatch`, which will buffer the input in one RowGroup. Comparing to direct buffer the records in memory or in arrow file, buffering to parquet might be more straightforward @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1609936769

   > So our files end with hundred or even thousands of RowGroups, which causing Metadata grows huge.
   
   Hundres of row groups for which file size?
   
   > This patch exports `parquet::arrow::FileWriter::WriteBatch`, which will buffer the input in one RowGroup. Comparing to direct buffer the records in memory or in arrow file, buffering to parquet might be more straightforward @pitrou
   
   But does it increase performance? Decrease file size? By how much?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611458030

   @jorisvandenbossche 
   
   ```c++
     Status NewRowGroup(int64_t chunk_size) override {
       if (row_group_writer_ != nullptr) {
         PARQUET_CATCH_NOT_OK(row_group_writer_->Close());
       }
       PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup());
       return Status::OK();
     }
   ```
   
   `parquet::arrow::FileWriterImpl::WriteTable` will call `NewRowGroup` for every chunk, which means it will first close previous RowGroup, then split input table to `chunk` by users row group size, and call `AppendRowGroup` to create non-buffered row-group writer for every chunk.
   
   `parquet::arrow::FileWriterImpl::WriteRecordBatch` will close previous rowgroup if previous row-group is not a "buffered" row-group, and append `RecordBatch` to the buffered row-group.
   
   So:
   
   1. If only `WriteTable` is called, every call will create **at lease one** new row group.
   2. If only `WriteRecord` is called, every call might not create row group.
   3. If `WriteRecordBatch` is called after `WriteTable`, it will find that there is not in-flight row-group, so it will create a buffered row-group and write to it.
   4. If `WriteTable` is called after `WriteRecordBatch`, it will find that there is a in-flight row-group, so it will close the buffered row-group, create a non-buffered one, and write to it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1607482752

   Hi, I'd like to add interface in Python, previously python code just use write_table:
   
   ```
       def write_batch(self, batch, row_group_size=None):
           """
           Write RecordBatch to the Parquet file.
   
           Parameters
           ----------
           batch : RecordBatch
           row_group_size : int, default None
               Maximum number of rows in written row group. If None, the
               row group size will be the minimum of the RecordBatch
               size and 1024 * 1024.  If set larger than 64Mi then 64Mi
               will be used instead.
           """
           table = pa.Table.from_batches([batch], batch.schema)
           self.write_table(table, row_group_size)
   ```
   
   Should I just change it to `write_record_batch`, or adding a argument to control it to avoid breaking previous behavior? @pitrou @jorisvandenbossche 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611332862

   > Should I just change it to `write_record_batch`, or adding a argument to control it to avoid breaking previous behavior? 
   
   I think we could also consider just "breaking" the current behaviour. It will still honor the max row_group_size, just no longer create one row group per batch that you are writing. 
   
   Or only do this if `row_group_size` keyword is None.
   
   What happens if you do a `write_table` after a `write_batch` if the `write_batch` didn't yet finalize the current row group?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611501785

   Thanks for the explanation @mapleFU . Perhaps we can add an optional argument to `parquet::arrow::FileWriterImpl::WriteTable` to tell it whether to act more like `WriteRecordBatch`?
   
   Something like:
   ```c++
     /// \brief Write a Table to Parquet.
     ///
     /// If `use_buffering` is false, then any pending row group is closed
     /// at the beginning and at the end of this call.
     /// If `use_buffering` is true, this function reuses an existing
     /// buffered row group until the chunk size is met, and leaves
     /// the last row group open for further writes.
     /// It is recommended to set `use_buffering` to true to minimize
     /// the number of row groups, especially when calling `WriteTable`
     /// with small tables.
     ///
     /// \param table Arrow table to write.
     /// \param chunk_size maximum number of rows to write per row group.
     /// \param use_buffering Whether to potentially buffer data.
     virtual ::arrow::Status WriteTable(
         const ::arrow::Table& table, int64_t chunk_size = DEFAULT_MAX_ROW_GROUP_LENGTH,
         bool use_buffering = false) = 0;
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1610598237

   I've a case, for example, we batch every 2000 rows. And write a file with 200,000 rows, the meta would be like:
   
   ```
     "TotalRows": "205434",
     "NumberOfRowGroups": "103",
     "NumberOfRealColumns": "97",
     "NumberOfColumns": "97",
   ```
   
   The file is 3.5Mib, after changing to one row-group, the file can be just 450kb. This also making reading from it slower


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1610573613

   @pitrou When RowGroup number grows, there're two problems:
   1. size of Metadata in a File grows large, specially when there are lots of columns
   2.during reading, `RowGroup(idx)` will be called many times, which making reading from file slow
   
   At lease I just want a `write_record_batch_without_flush_rowgroup` to make writing to parquet more lightweight


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611713374

   Yes, that is what I suggested above as well, to just change the behaviour, at least for the current `write_batch`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1609871912

   > Export C++ WriteRecordBatch in Python API to allow memory saving
   
   Can you explain which memory saving you're talking about? Can you show an example?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611774887

   I'll add a `WriteRecordBatch` with hint row-group size, and then bump it to Python script.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611512743

   I guess we can, but what about the Python interface it would be like? Previously I want add `write_record_batch` into `write_batch` or add a new interface here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1611515736

   Perhaps we can simply change the semantics on the Python side? I don't think Python users expect as much fine-grained control as C++ users. @jorisvandenbossche What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #36286: GH-36280: [Python][Parquet] Export C++ WriteRecordBatch in Python API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #36286:
URL: https://github.com/apache/arrow/pull/36286#issuecomment-1605956447

   :warning: GitHub issue #36280 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org