You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "XinyuZeng (via GitHub)" <gi...@apache.org> on 2023/03/07 14:35:17 UTC
[GitHub] [orc] XinyuZeng opened a new issue, #1430: DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
XinyuZeng opened a new issue, #1430:
URL: https://github.com/apache/orc/issues/1430
The default constructor of DataBuffer uses resize:
https://github.com/apache/orc/blob/764257a7fb7ade4de9fed3813c67cb36629d5b00/c%2B%2B/src/MemoryPool.cc#L54-L58
which has extra initialization and free operations than reserve:
https://github.com/apache/orc/blob/764257a7fb7ade4de9fed3813c67cb36629d5b00/c%2B%2B/src/MemoryPool.cc#L81-L94
However, in some use cases, for example, create a row batch (ColumnVectorBatch) and read data from ORC file, the resize operation is redundant and reserve should be enough. Not sure this is a problem but I found a significant amount of time will be spent on creating batch if the batch size is large.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] XinyuZeng commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1469904603
We can refer to the FileScan utility: https://github.com/apache/orc/blob/main/tools/src/FileScan.cc#L32. When the batch_size is set to large (e.g., the number of rows in the whole file), the scan time increases from ~0.5 to ~0.7, and the additional time is on the ColumnVectorBatch creation, specifically `new (buf + i) T()` operation, but this is not necessary.
I am doing this because the method of scanning to ColumnVectorBatch first and then transforming to another in-memory format (e.g., arrow) is not zero-copy. There is an opportunity to transfer the memory ownership of ColumnVectorBatch to Arrow ([link](https://github.com/apache/arrow/issues/21238)) (although it is hard given ORC's ColumnVectorBatch's API), but that requires the allocation step of ColumnVectorBatch to be efficient.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] dongjoon-hyun commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1460681586
cc @wgtmac , @stiga-huang , @coderex2522
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] XinyuZeng commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1471606047
I am happy to work on this and probably also the zero-copy to arrow in April if this is not in a hurry. Sorry but I got busy this month.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1471528974
Would you like to submit a PR to address the allocation issue? @XinyuZeng
In addition, some of the primitive types have same layout with arrow, we have the chance of zero-copy transfer by wrapping the orc::DataBuffer into arrow::Buffer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1471611252
No problem. Just take your time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]
Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1884630951
Sorry for the delaying on this. I filed #1739 to address this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]
Posted by "Smith-Cruise (via GitHub)" <gi...@apache.org>.
Smith-Cruise commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1884198165
> > I faced the same problem, DataBuffer's constructor is too heavy. ![kernel2](https://private-user-images.githubusercontent.com/18729228/293842708-a7ca0ad7-7230-43ee-8912-dc2877b0f664.svg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTMxMDYsIm5iZiI6MTcwNDY5MjgwNiwicGF0aCI6Ii8xODcyOTIyOC8yOTM4NDI3MDgtYTdjYTBhZDctNzIzMC00M2VlLTg5MTItZGMyODc3YjBmNjY0LnN2Zz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA1NDY0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNkODZlOTE5NWVmMDYxYmQzMjk0MzY4YWQ2MjA1MmNmMWQ5ZGIxMGNjZmU5YmQ1NzRhMjYyY2M0ZjZiNTE0OGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.U334OIzSsoyWyCawpBQeBteuUS5TlevEIKMUnVF0xkA)
> > Lot's of time waste in memset()
>
> Do you want to submit a fix? @Smith-Cruise
Actually, i didn't know how to fix it. I've tried using `reserve()` instead of `resize()`, but it will sometimes cause decompression error. Just looks like that `memset()` is necessary.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]
Posted by "Smith-Cruise (via GitHub)" <gi...@apache.org>.
Smith-Cruise commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1874884151
I faced the same problem, DataBuffer's constructor is too heavy.
![kernel2](https://github.com/apache/orc/assets/18729228/a7ca0ad7-7230-43ee-8912-dc2877b0f664)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [orc] wgtmac commented on issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1464919055
`reserve` only allocates the memory but does not call constructor of the objects of the buffer. This is what `resize` does and will be undefined behavior if not doing so. Could you share your use case or some profiling data (e.g. flame graph) so we can discuss optimization opportunities in detail? @XinyuZeng
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]
Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #1430:
URL: https://github.com/apache/orc/issues/1430#issuecomment-1880417264
> I faced the same problem, DataBuffer's constructor is too heavy. ![kernel2](https://private-user-images.githubusercontent.com/18729228/293842708-a7ca0ad7-7230-43ee-8912-dc2877b0f664.svg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQ2OTMxMDYsIm5iZiI6MTcwNDY5MjgwNiwicGF0aCI6Ii8xODcyOTIyOC8yOTM4NDI3MDgtYTdjYTBhZDctNzIzMC00M2VlLTg5MTItZGMyODc3YjBmNjY0LnN2Zz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTA4VDA1NDY0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNkODZlOTE5NWVmMDYxYmQzMjk0MzY4YWQ2MjA1MmNmMWQ5ZGIxMGNjZmU5YmQ1NzRhMjYyY2M0ZjZiNTE0OGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.U334OIzSsoyWyCawpBQeBteuUS5TlevEIKMUnVF0xkA)
>
> Lot's of time waste in memset()
Do you want to submit a fix? @Smith-Cruise
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize? [orc]
Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.
XinyuZeng closed issue #1430: [C++] DataBuffer Constructor (or a non default constructor) uses reserve instead of resize?
URL: https://github.com/apache/orc/issues/1430
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org