You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2021/12/14 08:02:18 UTC

[GitHub] [orc] expxiaoli opened a new pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary stri…

expxiaoli opened a new pull request #971:
URL: https://github.com/apache/orc/pull/971


   …ng encoding columns
   
   In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data.
   
   This patch logic has 3 parts:
   
   1) Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory.
   
   2) Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes.
   
   3) Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #971: ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1002765694


   I cherry-picked this to branch-1.7 for next Apache ORC 1.7.3 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-993264138


   see background info here: https://issues.apache.org/jira/browse/ORC-1060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-997689468


   @pgaref @wgtmac Could you review and verify it? 
   For reading string dictionary encoding column, this PR could reduce batch reading's memory usage to nearly row reading's memory usage, which could solve OOM issue in migration work from row reading to batch reading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun merged pull request #971: ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun merged pull request #971:
URL: https://github.com/apache/orc/pull/971


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli edited a comment on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli edited a comment on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1000203266


   @dongjoon-hyun In my perf test, speed is no regression for this patch. Here is scan time result for spark run query "insert overwrite table res_table select mapping['tag_a'] from src_table"
   
   "scan time" is spark metric in DataSourceScanExec class's doExecuteColumnar method , which only metric time for scan operator
   
   executor-memory | new ORC with this patch | old ORC
   2500M                   | 37.2s                                 | 34.6s
   2250M                   | 36.5s                                 | OOM
   1100M                    | 30.2s                                 | OOM
   1000M                    | OOM                                 | OOM
   
   Besides, this patch removes memory allocation for DynamicByteArray as well as memory copy from DynamicByteArray to primitive byte array, and do not add other time consuming logic. I think there is no potential regression for speed.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] guiyanakuang commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
guiyanakuang commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1000219651


   Before
   `InStream` ----> `DynamicByteArray(data[][])` -----> `byte[]`
   After
   `InStream` ----> `byte[]`
   
   `DynamicByteArray` plays no other role in the context and seems completely redundant, I approve of this pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli edited a comment on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli edited a comment on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1000203266


   @dongjoon-hyun In my perf test, speed is no regression for this patch. Here is scan time result for spark run query "insert overwrite table res_table select mapping['tag_a'] from src_table"
   
   "scan time" is spark metric in DataSourceScanExec class's doExecuteColumnar method , which only metric time for scan operator
   
   executor-memory | new ORC with this patch | old ORC
   2500M                   | 37.2S                                 | 34.6S
   2250M                   | 36.5S                                 | OOM
   1100M                    | 30.2S                                 | OOM
   1000M                    | OOM                                 | OOM
   
   Besides, this patch removes memory allocation for DynamicByteArray as well as memory copy from DynamicByteArray to primitive byte array, and do not add other time consuming logic. I think there is no potential regression for speed.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-997666048


   Here is perf test.
   I create a orc table named src_table with map<string, string> column named mapping, which stripe's string dictionary could occupy 466M memory.
   Then I run a spark query to read this column:
   insert overwrite table res_table select mapping['tag_a'] from src_table;
   
   With old orc lib, only if I set executor-memory to equal or larger than 2500M, the query could run successfully. Otherwise the query will fail with OOM exception. Here is perf result with MAT tool when I run query with 2500M executor-memory
   ![内存占用情况_DynamicByteArray](https://user-images.githubusercontent.com/2948397/146728253-95e6530f-58d0-4ee0-97b3-e9d252cf2084.png)
   
   With orc lib with this new patch, the query could run successfully when executor-memory is decreased to 1200M. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #971: ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1000969511


   @expxiaoli . I added you to the Apache ORC contributor group and assigned ORC-1060 to you.
   Welcome to the Apache ORC community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun edited a comment on pull request #971: ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1002765694


   I cherry-picked this to branch-1.7 for next Apache ORC 1.7.3 release.
   We will test the performance impact with the downstream projects before releasing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-993949868


   Thank you for making a PR, @expxiaoli .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli edited a comment on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli edited a comment on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1000203266


   @dongjoon-hyun In my perf test, speed is no regression for this patch. Here is scan time result for spark run query "insert overwrite table res_table select mapping['tag_a'] from src_table"
   
   "scan time" is spark metric in FileSourceScanExec class's doExecuteColumnar method , which only metric time for scan operator
   
   executor-memory | new ORC with this patch | old ORC
   2500M                   | 37.2s                                 | 34.6s
   2250M                   | 36.5s                                 | OOM
   1100M                    | 30.2s                                 | OOM
   1000M                   | OOM                                 | OOM
   
   Besides, this patch removes memory allocation for DynamicByteArray as well as memory copy from DynamicByteArray to primitive byte array, and do not add other time consuming logic. I think there is no potential regression for speed.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-994396748


   cc @pgaref 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli commented on pull request #971: ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1000203266


   @dongjoon-hyun In my perf test, speed is no regression for this patch. Here is scan time result for spark run query "insert overwrite table res_table select mapping['tag_a'] from src_table"
   
   "scan time" is spark metric in DataSourceScanExec class's doExecuteColumnar method , which only metric time for scan operator
   
   executor-memory | new ORC with this patch | old ORC
   2500M                   | 37.2S                                 | 34.6S
   2250M                   | 36.5S                                 | OOM
   1100M                    | 30.2S                                 | OOM
   1000M                    | OOM                                 | OOM
   
   Besides, this patch removes memory allocation for DynamicByteArray as well as memory copy from DynamicByteArray to primitive byte array, and do NOT add other time consuming logic. I think there is NO potential regression for speed.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] expxiaoli commented on pull request #971: ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns

Posted by GitBox <gi...@apache.org>.
expxiaoli commented on pull request #971:
URL: https://github.com/apache/orc/pull/971#issuecomment-1003254041


   Thanks @dongjoon-hyun


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org