You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "candiduslynx (via GitHub)" <gi...@apache.org> on 2023/06/19 11:00:47 UTC

[GitHub] [arrow] candiduslynx opened a new pull request, #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

candiduslynx opened a new pull request, #36163:
URL: https://github.com/apache/arrow/pull/36163

   ### Rationale for this change
   
   Docs to help people decide on the best-performing option.
   
   ### What changes are included in this PR?
   
   Doc change only.
   
   ### Are these changes tested?
   
   N/A
   
   ### Are there any user-facing changes?
   
   Doc
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1234761572


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,6 +134,13 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
+// WriteBuffered allows to write records and decide where to break your row group
+// based on the TotalBytesWritten rather than on the max row group len.

Review Comment:
   Done in 9c4ca94137a143986e69feb07a0ab94baee7e334



##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -181,7 +188,10 @@ func (fw *FileWriter) WriteBuffered(rec arrow.Record) error {
 }
 
 // Write an arrow Record Batch to the file, respecting the MaxRowGroupLength in the writer
-// properties to determine whether or not a new row group is created while writing.
+// properties to determine whether a new row group is created or not while writing.

Review Comment:
   Done in 9c4ca94137a143986e69feb07a0ab94baee7e334



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade merged pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade merged PR #36163:
URL: https://github.com/apache/arrow/pull/36163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1234761708


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -181,7 +188,10 @@ func (fw *FileWriter) WriteBuffered(rec arrow.Record) error {
 }
 
 // Write an arrow Record Batch to the file, respecting the MaxRowGroupLength in the writer
-// properties to determine whether or not a new row group is created while writing.
+// properties to determine whether a new row group is created or not while writing.
+//
+// Performance-wise Write might be more favorable than WriteBuffered
+// especially if dealing with records that have a lot of rows.

Review Comment:
   Done in 9c4ca94137a143986e69feb07a0ab94baee7e334



##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,6 +134,13 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
+// WriteBuffered allows to write records and decide where to break your row group
+// based on the TotalBytesWritten rather than on the max row group len.
+// If using Records, this should be paired with NewBufferedRowGroup,
+// while Write will always write a new record as a row group in and of itself.
+//
+// Performance-wise WriteBuffered might be more favorable than Write
+// especially if dealing with lots of records that have only a small amount of rows.

Review Comment:
   Done in 9c4ca94137a143986e69feb07a0ab94baee7e334



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1234553891


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -181,7 +188,10 @@ func (fw *FileWriter) WriteBuffered(rec arrow.Record) error {
 }
 
 // Write an arrow Record Batch to the file, respecting the MaxRowGroupLength in the writer
-// properties to determine whether or not a new row group is created while writing.
+// properties to determine whether a new row group is created or not while writing.

Review Comment:
   Let's update this to say `to determine whether the record is broken up into more than one row group` and add a statement that it will always create at least one row group for the record, or something like that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1234553594


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,6 +134,13 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
+// WriteBuffered allows to write records and decide where to break your row group
+// based on the TotalBytesWritten rather than on the max row group len.
+// If using Records, this should be paired with NewBufferedRowGroup,
+// while Write will always write a new record as a row group in and of itself.
+//
+// Performance-wise WriteBuffered might be more favorable than Write
+// especially if dealing with lots of records that have only a small amount of rows.

Review Comment:
    Mention that the tradeoff is that more memory will be utilized to keep the whole row group buffered in memory before it starts writing (since Parquet files must write an entire column before writing the next column).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1235490902


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,13 +134,19 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
-// WriteBuffered allows to write records and decide where to break your row group
-// based on the TotalBytesWritten rather than on the max row group len.
-// If using Records, this should be paired with NewBufferedRowGroup,
-// while Write will always write a new record as a row group in and of itself.
+// WriteBuffered will either append to an existing row group or create a new one
+// based on the record length and RowGroupTotalBytesWritten.

Review Comment:
   `WriteBuffered` itself never checks `RowGroupTotalBytesWritten`, unless i'm missing something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1235497017


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,13 +134,19 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
-// WriteBuffered allows to write records and decide where to break your row group
-// based on the TotalBytesWritten rather than on the max row group len.
-// If using Records, this should be paired with NewBufferedRowGroup,
-// while Write will always write a new record as a row group in and of itself.
+// WriteBuffered will either append to an existing row group or create a new one
+// based on the record length and RowGroupTotalBytesWritten.

Review Comment:
   fixed a typo in 25d51d5a58e86ca527d31547ec21e6878e7a852f



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36163:
URL: https://github.com/apache/arrow/pull/36163#issuecomment-1599946410

   Conbench analyzed the 6 benchmark runs on commit `94af6c3c`.
   
   There were 3 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-06-20 23:29:50Z](http://conbench.ursa.dev/compare/runs/4685a9c01c4c4edba209c332bf7b137d...1e934c9afe7d481088ba400f4707cd44/)
     - [params=1048576/10000, source=cpp-micro, suite=arrow-acero-aggregate-benchmark](http://conbench.ursa.dev/compare/benchmarks/064920e31d3073d38000d858a75b9902...06492369806f75028000b1fa78f837c9)
     - [params=DecodeArrow_Dense/32768, source=cpp-micro, suite=parquet-encoding-benchmark](http://conbench.ursa.dev/compare/benchmarks/064920e702db72ad80005c72c0b0e9ad...0649236d4a7272478000837101d72282)
   - and 1 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14423282661) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1234554181


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -181,7 +188,10 @@ func (fw *FileWriter) WriteBuffered(rec arrow.Record) error {
 }
 
 // Write an arrow Record Batch to the file, respecting the MaxRowGroupLength in the writer
-// properties to determine whether or not a new row group is created while writing.
+// properties to determine whether a new row group is created or not while writing.
+//
+// Performance-wise Write might be more favorable than WriteBuffered
+// especially if dealing with records that have a lot of rows.

Review Comment:
   Let's mention that this would be more performant in the case of a highly restricted memory environment, or if you have very large records with lots of rows, potentially close to the `max row group length`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36163:
URL: https://github.com/apache/arrow/pull/36163#discussion_r1234553152


##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,6 +134,13 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
+// WriteBuffered allows to write records and decide where to break your row group
+// based on the TotalBytesWritten rather than on the max row group len.

Review Comment:
   TotalBytesWritten will still break row groups on the max row group len, not TotalBytesWritten. It does, however, allow users to use TotalBytesWritten to decide when to break row groups themselves by checking `TotalBytesWritten` and calling `NewBuffereredRowGroup` themselves.



##########
go/parquet/pqarrow/file_writer.go:
##########
@@ -134,6 +134,13 @@ func (fw *FileWriter) RowGroupTotalBytesWritten() int64 {
 	return 0
 }
 
+// WriteBuffered allows to write records and decide where to break your row group
+// based on the TotalBytesWritten rather than on the max row group len.

Review Comment:
   `WriteBuffered` will still break row groups on the max row group len, not TotalBytesWritten. It does, however, allow users to use TotalBytesWritten to decide when to break row groups themselves by checking `TotalBytesWritten` and calling `NewBuffereredRowGroup` themselves.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36163: GH-36095: [Go] Add doc for `pqarrow.FileWriter.WriteBuffered`

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36163:
URL: https://github.com/apache/arrow/pull/36163#issuecomment-1596974472

   :warning: GitHub issue #36095 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org