You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "JkSelf (via GitHub)" <gi...@apache.org> on 2023/05/05 07:49:40 UTC

[GitHub] [arrow] JkSelf opened a new issue, #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

JkSelf opened a new issue, #35444:
URL: https://github.com/apache/arrow/issues/35444

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   We are currently using arrow's parquet write API to write parquet files. It was found that different APIs will bring different performance data. We wrote a simple benchmark to read 1GB of tpcds `store_sales ` data [here](https://github.com/JkSelf/gluten/blob/1015756c07d326dc301d2b1824c8c40dc59b021b/cpp/velox/benchmarks/ParquetWriteBenchmark.cc#L220). Then use `FileWriter#WriteTableAPI()`, `FileWriter#WriteRecordBatch()` to write data respectively, and found the `FileWriter#WriteTableAPI()` need **800s** and the `FileWriter#WriteRecordBatch()`  only need **4.9s** . The strange thing here is why the performance of WriteTableAPI is so poor.   I want to ask, is the way I use it wrong?
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1535879643

   I don't know if it's the real reason. But seems that `WriteTable` would force using non-buffered RowGroup Writer. Would you mind profile/benchmark these two method and tell us where cost more time? And maybe `src/parquet/arrow/reader_writer_benchmark.cc` helps?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1537059497

   These two api may have different buffered/non-buffered write path, maybe you can try to config them or avoid some thrink to fit if possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JkSelf closed issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "JkSelf (via GitHub)" <gi...@apache.org>.
JkSelf closed issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()
URL: https://github.com/apache/arrow/issues/35444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JkSelf commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "JkSelf (via GitHub)" <gi...@apache.org>.
JkSelf commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1535870738

   @[lidavidm](https://github.com/lidavidm) Do you have any input?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JkSelf commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "JkSelf (via GitHub)" <gi...@apache.org>.
JkSelf commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1537060016

   Can you elaborate on how to configure it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1620046830

   @JkSelf 
   
   ```c++
     Status NewRowGroup(int64_t chunk_size) override {
       if (row_group_writer_ != nullptr) {
         PARQUET_CATCH_NOT_OK(row_group_writer_->Close());
       }
       PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup());
       return Status::OK();
     }
   ```
   
   `parquet::arrow::FileWriterImpl::WriteTable` will call `NewRowGroup` for every chunk, which means it will first close previous RowGroup, then split input table to `chunk` by users row group size, and call `AppendRowGroup` to create non-buffered row-group writer for every chunk.
   
   `parquet::arrow::FileWriterImpl::WriteRecordBatch` will close previous rowgroup if previous row-group is not a "buffered" row-group, and append `RecordBatch` to the buffered row-group.
   
   And I'm working on buffered write table, may it helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1536472154

   I've profile `WriteTable` in `reader_writer_benchmark.cc`, and didn't find the operation can make it slower. Would you mind run profiler on your testing environment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JkSelf commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "JkSelf (via GitHub)" <gi...@apache.org>.
JkSelf commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1537058403

   The reason is that `DataBuffer ` defined in the stream is directly reallocated in `reverse()` method , and memcpy  must be called every time, resulting in performance degradation. And the perf can be 4s after changed the reallocated size to 2 times. I will closing this issue.  Thanks again @mapleFU. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JkSelf commented on issue #35444: What is the different between FileWriter#WriteTableAPI() and FileWriter#WriteRecordBatch()

Posted by "JkSelf (via GitHub)" <gi...@apache.org>.
JkSelf commented on issue #35444:
URL: https://github.com/apache/arrow/issues/35444#issuecomment-1536950532

   @mapleFU
   Thanks for your response. I profiled my benchmark and found that the problem here is that the [stream ](https://github.com/JkSelf/gluten/blob/1015756c07d326dc301d2b1824c8c40dc59b021b/cpp/velox/benchmarks/ParquetWriteBenchmark.cc#L208)I defined has been reallocating. It maybe my wrong usage cause the perf slow. I will update further with new findings. Thank for your help again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org