You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Ashutosh Chauhan (Jira)" <ji...@apache.org> on 2020/08/06 23:57:00 UTC

[jira] [Commented] (TEZ-4207) Provide approximate number of input records to be processed in UnorderedKVInput

    [ https://issues.apache.org/jira/browse/TEZ-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172732#comment-17172732 ] 

Ashutosh Chauhan commented on TEZ-4207:
---------------------------------------

+1

> Provide approximate number of input records to be processed in UnorderedKVInput
> -------------------------------------------------------------------------------
>
>                 Key: TEZ-4207
>                 URL: https://issues.apache.org/jira/browse/TEZ-4207
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>         Attachments: TEZ-4207.1.patch, TEZ-4207.wip.patch
>
>
> There are cases when broadcasted data is loaded into hashtable in upstream applications (e.g Hive). Apps tends to predict the number of entries in the hashtable diligently, but there are cases where these estimates can be very complicated at compile time.
>  
> Tez can help in such cases, by providing "approximate number of input records counter", to be processed in UnorderedKVInput. This is to avoid expensive rehash when hashtable sizes are not estimated correctly. It would be good to start with broadcast first and then to move on to unordered partitioned case later.
>  
> This would help in predicting the number of entries at runtime & can get better estimates for hashtable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)