You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/08/30 15:15:22 UTC

[GitHub] [orc] LouisClt opened a new issue, #1240: Huge memory taken for each field when exporting

LouisClt opened a new issue, #1240:
URL: https://github.com/apache/orc/issues/1240

Hello,
Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records.
Even for 100 fields, that could take 100Mo+.
The "issue" seems to be coming from here :
https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/ColumnWriter.cc#L59

When we create a writer with the "createWriter" (https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/Writer.cc#L681-L684 ), a stream (compressor) is created for each field. As we allocate a Buffer of 1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for each field.

Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ?
Thanks

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1232430909

   cc @wgtmac , @stiga-huang , @coderex2522 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on issue #1240: Huge memory taken for each field when exporting

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1418927085

   Hi, @LouisClt . FYI, according to the Apache ORC release cycle, newly developed features will be delivered via v1.9.0 on September 2023 (if they are merged to Apache ORC before.)
   - https://github.com/apache/orc/milestones


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

wgtmac commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1258867909

   I have created a JIRA to track the progress: https://issues.apache.org/jira/browse/ORC-1280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] coderex2522 commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

coderex2522 commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1258886800

   @dongjoon-hyun @wgtmac @LouisClt I will follow up on this issues(ORC-1290) and implement a much smarter memory management.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

wgtmac commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1244805812

   We may replace the DataBuffer by a new Buffer implementation with a much smarter memory management to automatically grow and shrink its size according to actual usage. This management can happen on the column basis.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] dongjoon-hyun commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1260182779

   Thank you, @coderex2522 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] LouisClt commented on issue #1240: Huge memory taken for each field when exporting

Posted by "LouisClt (via GitHub)" <gi...@apache.org>.

LouisClt commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1418798429

   Thanks for your reply @wgtmac and the implementation of the `BlockBuffer`.
   I'll wait for the replacement of the `rawInputBuffer` by the `BlockBuffer` in every compression stream then. Do you think it will take long ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on issue #1240: Huge memory taken for each field when exporting

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1401766927

   > Hello, it seems there were commits referencing this issue. Is this issue now fixed ?
   
   @LouisClt Thanks for your follow-up.
   
   We have implemented a block-based buffer called `BlockBuffer` (by @coderex2522) and used it to replace the output buffer in the `CompressionStream`. It can decrease the memory footprint to some extent. 
   
   IMO, the next step is to use it to replace the input buffer of the `CompressionStream` which has the size of `compressionBlockSize` per stream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] LouisClt commented on issue #1240: Huge memory taken for each field when exporting

Posted by "LouisClt (via GitHub)" <gi...@apache.org>.

LouisClt commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1419097729

   Understood, and thanks for your answer !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] LouisClt commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

LouisClt commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1255074200

   Thanks everyone for your answers. I understand the possible performances issues linked with lowering too much the size of the buffer (on my testing it was OK in my case though).
   I think the solution given by @wgtmac would be fine for me, and better than passing by global variables, if it is feasible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] wgtmac commented on issue #1240: Huge memory taken for each field when exporting

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1415187527

   > > Hello, it seems there were commits referencing this issue. Is this issue now fixed ?
   > 
   > @LouisClt Thanks for your follow-up.
   > 
   > We have implemented a block-based buffer called `BlockBuffer` (by @coderex2522) and used it to replace the output buffer in the `CompressionStream`. It can decrease the memory footprint to some extent.
   > 
   > IMO, the next step is to use it to replace the input buffer of the `CompressionStream` which has the size of `compressionBlockSize` per stream.
   
   To be precise, the `rawInputBuffer` of every CompressionStream is fixed to the compression block size which is 1M by default. Writer with many columns will suffer from large memory footprint and nothing can be done to alleviate it.
   
   I have created a JIRA to track it: https://issues.apache.org/jira/browse/ORC-1365
   
   cc @coderex2522 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] luffy-zh commented on issue #1240: Huge memory taken for each field when exporting

Posted by "luffy-zh (via GitHub)" <gi...@apache.org>.

luffy-zh commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1420047447

   I will work on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] LouisClt commented on issue #1240: Huge memory taken for each field when exporting

Posted by "LouisClt (via GitHub)" <gi...@apache.org>.

LouisClt commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1400662629

   Hello, it seems there were commits referencing this issue. Is this issue now fixed ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [orc] coderex2522 commented on issue #1240: Huge memory taken for each field when exporting

Posted by GitBox <gi...@apache.org>.

coderex2522 commented on issue #1240:
URL: https://github.com/apache/orc/issues/1240#issuecomment-1233150605

   @LouisClt  To support the zero-copy mechanism, class BufferedOutputStream will have an internal data buffer. And the default capacity of  the internal data buffer is 1MB. This default capacity size should be able to be modified, but here's a hint that if the buffer capacity is set too small, it may cause the buffer to expand and trigger memcpy function frequently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org