You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "XinyuZeng (via GitHub)" <gi...@apache.org> on 2023/03/07 14:35:17 UTC

[GitHub] [orc] XinyuZeng opened a new issue, #1430: DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

XinyuZeng opened a new issue, #1430:
URL: https://github.com/apache/orc/issues/1430

   The default constructor of DataBuffer uses resize:
   
   https://github.com/apache/orc/blob/764257a7fb7ade4de9fed3813c67cb36629d5b00/c%2B%2B/src/MemoryPool.cc#L54-L58
   
   which has extra initialization and free operations than reserve:
   
   https://github.com/apache/orc/blob/764257a7fb7ade4de9fed3813c67cb36629d5b00/c%2B%2B/src/MemoryPool.cc#L81-L94
   
   However, in some use cases, for example, create a row batch (ColumnVectorBatch) and read data from ORC file, the resize operation is redundant and reserve should be enough. Not sure this is a problem but I found a significant amount of time will be spent on creating batch if the batch size is large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] XinyuZeng commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1469904603

   We can refer to the FileScan utility: https://github.com/apache/orc/blob/main/tools/src/FileScan.cc#L32. When the batch_size is set to large (e.g., the number of rows in the whole file), the scan time increases from ~0.5 to ~0.7, and the additional time is on the ColumnVectorBatch creation, specifically `new (buf + i) T()` operation, but this is not necessary.
   
   I am doing this because the method of scanning to ColumnVectorBatch first and then transforming to another in-memory format (e.g., arrow) is not zero-copy. There is an opportunity to transfer the memory ownership of ColumnVectorBatch to Arrow ([link](https://github.com/apache/arrow/issues/21238)) (although it is hard given ORC's ColumnVectorBatch's API), but that requires the allocation step of ColumnVectorBatch to be efficient. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1460681586

   cc @wgtmac , @stiga-huang , @coderex2522


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] XinyuZeng commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1471606047

   I am happy to work on this and probably also the zero-copy to arrow in April if this is not in a hurry. Sorry but I got busy this month.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1471528974

   Would you like to submit a PR to address the allocation issue? @XinyuZeng 
   
   In addition, some of the primitive types have same layout with arrow, we have the chance of zero-copy transfer by wrapping the orc::DataBuffer into arrow::Buffer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1471611252

   No problem. Just take your time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1884630951

   Sorry for the delaying on this. I filed #1739 to address this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]

Posted by "Smith-Cruise (via GitHub)" <gi...@apache.org>.
Smith-Cruise commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1884198165

   > > I faced the same problem, DataBuffer's constructor is too heavy. ![kernel2](https://private-user-images.githubusercontent.com/18729228/293842708-a7ca0ad7-7230-43ee-8912-dc2877b0f664.svg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTMxMDYsIm5iZiI6MTcwNDY5MjgwNiwicGF0aCI6Ii8xODcyOTIyOC8yOTM4NDI3MDgtYTdjYTBhZDctNzIzMC00M2VlLTg5MTItZGMyODc3YjBmNjY0LnN2Zz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA1NDY0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNkODZlOTE5NWVmMDYxYmQzMjk0MzY4YWQ2MjA1MmNmMWQ5ZGIxMGNjZmU5YmQ1NzRhMjYyY2M0ZjZiNTE0OGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.U334OIzSsoyWyCawpBQeBteuUS5TlevEIKMUnVF0xkA)
   > > Lot's of time waste in memset()
   > 
   > Do you want to submit a fix? @Smith-Cruise
   
   Actually, i didn't know how to fix it. I've tried using `reserve()` instead of `resize()`, but it will sometimes cause decompression error. Just looks like that `memset()` is necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]

Posted by "Smith-Cruise (via GitHub)" <gi...@apache.org>.
Smith-Cruise commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1874884151

   I faced the same problem, DataBuffer's constructor is too heavy.
   ![kernel2](https://github.com/apache/orc/assets/18729228/a7ca0ad7-7230-43ee-8912-dc2877b0f664)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1464919055

   `reserve` only allocates the memory but does not call constructor of the objects of the buffer.  This is what `resize` does and will be undefined behavior if not doing so. Could you share your use case or some profiling data (e.g. flame graph) so we can discuss optimization opportunities in detail? @XinyuZeng  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1880417264

   > I faced the same problem, DataBuffer's constructor is too heavy. ![kernel2](https://private-user-images.githubusercontent.com/18729228/293842708-a7ca0ad7-7230-43ee-8912-dc2877b0f664.svg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTMxMDYsIm5iZiI6MTcwNDY5MjgwNiwicGF0aCI6Ii8xODcyOTIyOC8yOTM4NDI3MDgtYTdjYTBhZDctNzIzMC00M2VlLTg5MTItZGMyODc3YjBmNjY0LnN2Zz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA1NDY0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNkODZlOTE5NWVmMDYxYmQzMjk0MzY4YWQ2MjA1MmNmMWQ5ZGIxMGNjZmU5YmQ1NzRhMjYyY2M0ZjZiNTE0OGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.U334OIzSsoyWyCawpBQeBteuUS5TlevEIKMUnVF0xkA)
   > 
   > Lot's of time waste in memset()
   
   Do you want to submit a fix? @Smith-Cruise 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng closed issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
URL: https://github.com/apache/orc/issues/1430


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org