You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/04/05 15:27:00 UTC

[jira] [Work logged] (HIVE-25967) Prevent residual expressions from getting serialized in Iceberg splits

     [ https://issues.apache.org/jira/browse/HIVE-25967?focusedWorklogId=752945&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-752945 ]

ASF GitHub Bot logged work on HIVE-25967:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Apr/22 15:26
            Start Date: 05/Apr/22 15:26
    Worklog Time Spent: 10m 
      Work Description: szlta opened a new pull request, #3178:
URL: https://github.com/apache/hive/pull/3178

   I originally thought that we only need the hack whenever residuals are present, so I added this condition:
   
   https://github.com/apache/hive/commit/1aa6ce800004798e78ea53c3bec2beedb5f55b6c#diff-9487d7073613adf5132783cf905ea72164eb4c19461c50e5ce3cd735bb5704a3R127
   
   What I didn't know is that in some cases the residuals() invocation may end up returning True while the expression is still some longer construct. The residuals() invocation actually evaluates said expression against the partition information found in the base scan file task... Because of this the residuals are left untouched and will cause OOM.. 
   
   This addendum removes aforementioned unnecessary condition




Issue Time Tracking
-------------------

    Worklog Id:     (was: 752945)
    Time Spent: 50m  (was: 40m)

> Prevent residual expressions from getting serialized in Iceberg splits
> ----------------------------------------------------------------------
>
>                 Key: HIVE-25967
>                 URL: https://issues.apache.org/jira/browse/HIVE-25967
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> This hack removes residual expressions from the file scan task just before split serialization.
> Residuals can sometime take up too much space in the payload causing Tez AM to OOM.
> Unfortunately Tez AM doesn't distribute splits in a streamed way, that is, it serializes all splits for a job before sending them out to executors. Some residuals may take ~ 1 MB in memory, multiplied with thousands of split could kill the Tez AM JVM.
> Until the streamed split distribution is implemented we will kick residuals out of the split.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)