You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "candiduslynx (via GitHub)" <gi...@apache.org> on 2023/06/15 13:08:41 UTC

[GitHub] [arrow] candiduslynx opened a new issue, #36095: Make `WriteBuffered` a default

candiduslynx opened a new issue, #36095:
URL: https://github.com/apache/arrow/issues/36095

   ### Describe the enhancement requested
   
   A follow-up to https://github.com/cloudquery/filetypes/pull/203.
   
   Basically, we saw a drastic improvement after switching to a buffered mode, so maybe making `Write` be buffered and introducing a `WriteUnbuffered` func to perform as current `Write` is a better way?
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: [Go] Make `WriteBuffered` a default

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1595134347

   🤷‍♂️ 
   Do you have specific contents you want to put there?
   Should it be for both `Write` & `WriteBuffered` methods?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #36095: [Go] Make `WriteBuffered` a default

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1594852098

   Would you be up for adding a PR for this? :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: [Go] Make `WriteBuffered` a default

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1595718765

   I think for small records (with small amount of rows) the buffered version will have less memory footprint, too, as shown by our tests.
   I'll try to perform some additional testing regarding the amount of rows it takes to match the performance, so that's included in the docs, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: Make `WriteBuffered` a default

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1593205439

   @zeroshade we already used bufio.Writer in the benchmark: https://github.com/cloudquery/filetypes/blob/main/parquet/write_read_test.go#L86
   
   I think the main issue is that each record has its own row group that adds quite an overhead for a large amount of records.
   In out benchmark tests (see PR description) we generated 10K records and then wrote them


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade closed issue #36095: [Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade closed issue #36095: [Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise
URL: https://github.com/apache/arrow/issues/36095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #36095: [Go] Make `WriteBuffered` a default

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1595265705

   We should definitely include updated docs for both Write and WriteBuffered (I believe WriteBuffered currently doesn't actually have a doc string associated with it).
   
   As for specific contents, I say go for mentioning the pros and cons for one over the other when dealing with record batches. If your record batches are significantly large (i.e. you want row groups to be roughly the same layout as your records) then `Write` makes sense, alternately if you are getting lots of smaller records that you want to ensure get aggregated into a larger row group, then you'll need to use `WriteBuffered` which will have higher memory usage but better write performance (and likely better read performance too).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: [Go] Make `WriteBuffered` a default

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1594438434

   I think the doc change would be fine, so that the users will know about different modes & will choose the write mode with enough info in mind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #36095: Make `WriteBuffered` a default

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1593194307

   I'd be curious *why* the `WriteBuffered` was more performant than unbuffered, and if you'd be better served by just using  https://pkg.go.dev/bufio#Writer as the underlying io.Writer rather than using the WriteBuffered for parquet (or using larger / different record sizes).
   
   In general, the unbuffered writing is going to be significantly more memory efficient unless you're trying to essentially collect multiple records into a single row-group or otherwise need to write to multiple columns before actually writing it to the parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: Make `WriteBuffered` a default

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1593017570

   cc: @zeroshade 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: [Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1596976928

   I've checked that even for 100 rows/record the buffered mode behaves better than the simple write, I wonder if the scale is 1000s to match them


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] candiduslynx commented on issue #36095: Make `WriteBuffered` a default

Posted by "candiduslynx (via GitHub)" <gi...@apache.org>.
candiduslynx commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1593262301

   > That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default.
   
   @zeroshade I tried with 100 records (each with a single row), the difference still persists:
   ```sh
    go test \
     -test.run=BenchmarkWrite \
     -test.bench=BenchmarkWrite \
   -test.count 10 -test.benchmem -test.benchtime 100x
   ```
   
   <details><summary>Write</summary>
   
   ```
   goos: darwin
   goarch: arm64
   pkg: github.com/cloudquery/filetypes/v3/parquet
   BenchmarkWrite-10            100           3579741 ns/op         5923408 B/op      45296 allocs/op
   BenchmarkWrite-10            100           3579718 ns/op         5923500 B/op      45296 allocs/op
   BenchmarkWrite-10            100           3542423 ns/op         5923485 B/op      45296 allocs/op
   BenchmarkWrite-10            100           3539510 ns/op         5923228 B/op      45294 allocs/op
   BenchmarkWrite-10            100           3582708 ns/op         5923466 B/op      45295 allocs/op
   BenchmarkWrite-10            100           3631385 ns/op         5923999 B/op      45297 allocs/op
   BenchmarkWrite-10            100           3615708 ns/op         5923475 B/op      45296 allocs/op
   BenchmarkWrite-10            100           3573869 ns/op         5923664 B/op      45297 allocs/op
   BenchmarkWrite-10            100           3625930 ns/op         5923538 B/op      45296 allocs/op
   BenchmarkWrite-10            100           3643602 ns/op         5923328 B/op      45294 allocs/op
   PASS
   ok      github.com/cloudquery/filetypes/v3/parquet      4.667s
   ```
   </details>
   
   <details><summary>WriteBuffered</summary>
   
   ```
   goos: darwin
   goarch: arm64
   pkg: github.com/cloudquery/filetypes/v3/parquet
   BenchmarkWrite-10            100           1063981 ns/op         1288338 B/op      17082 allocs/op
   BenchmarkWrite-10            100           1067547 ns/op         1288197 B/op      17083 allocs/op
   BenchmarkWrite-10            100           1034933 ns/op         1288312 B/op      17082 allocs/op
   BenchmarkWrite-10            100           1044867 ns/op         1288250 B/op      17081 allocs/op
   BenchmarkWrite-10            100           1020637 ns/op         1288325 B/op      17081 allocs/op
   BenchmarkWrite-10            100           1013818 ns/op         1288170 B/op      17083 allocs/op
   BenchmarkWrite-10            100           1028132 ns/op         1288359 B/op      17083 allocs/op
   BenchmarkWrite-10            100           1012284 ns/op         1288004 B/op      17082 allocs/op
   BenchmarkWrite-10            100           1009228 ns/op         1288231 B/op      17082 allocs/op
   BenchmarkWrite-10            100           1038570 ns/op         1288419 B/op      17082 allocs/op
   PASS
   ok      github.com/cloudquery/filetypes/v3/parquet      2.183s
   ```
   </details>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #36095: Make `WriteBuffered` a default

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1593315358

   The difference makes sense to me, as you pointed out, it's likely due to the overhead of having many separate small row-groups rather than one big row group. One suggestion I'd make is to gather the records to create an `arrow.Table` and try using `WriteTable` and see what the performance is like (with some large chunk size) as it will use buffered writing internally to do that.
   
   My personal opinion is still that I wouldn't necessarily want to change the default behavior out from under people. We could, however, improve the documentation and comments on the respective functions though.
   
   Alternately, another solution might be to have a Write method which can choose to use buffered/non-buffered based on whether a consumer calls `NewRowGroup` or `NewBufferedRowGroup` themselves etc. and do the writes up until the max row group length and return how many rows were written / how many rows were left over / some other indication. And give consumers more granular control over the writing in that respect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #36095: Make `WriteBuffered` a default

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1593246808

   That makes sense to me, you'd want more than 10k rows per row-group. Personally I don't want to change the default out from under people given the semantic difference that would happen by changing this default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #36095: [Go] Document `pqarrow.FileWriter.WriteBuffered` & `pqarrow.FileWriter.Write` perfromance-wise

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #36095:
URL: https://github.com/apache/arrow/issues/36095#issuecomment-1598014309

   I'd expect it to need to be thousands or even tens of thousands for non-buffered to be better most likely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org