You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "XinyuZeng (via GitHub)" <gi...@apache.org> on 2023/03/15 12:19:25 UTC

[GitHub] [orc] XinyuZeng commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1469904603

   We can refer to the FileScan utility: https://github.com/apache/orc/blob/main/tools/src/FileScan.cc#L32. When the batch_size is set to large (e.g., the number of rows in the whole file), the scan time increases from ~0.5 to ~0.7, and the additional time is on the ColumnVectorBatch creation, specifically `new (buf + i) T()` operation, but this is not necessary.
   
   I am doing this because the method of scanning to ColumnVectorBatch first and then transforming to another in-memory format (e.g., arrow) is not zero-copy. There is an opportunity to transfer the memory ownership of ColumnVectorBatch to Arrow ([link](https://github.com/apache/arrow/issues/21238)) (although it is hard given ORC's ColumnVectorBatch's API), but that requires the allocation step of ColumnVectorBatch to be efficient. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org