You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Authur Wang (Jira)" <ji...@apache.org> on 2022/09/21 02:43:00 UTC

[jira] [Commented] (TEZ-4442) tez unable to control the memory size when UDF occupies 100MB memory

    [ https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607466#comment-17607466 ] 

Authur Wang commented on TEZ-4442:
----------------------------------

We finally solved the problem. The process of the problem was as follows: The HDFS file required by UDF was loaded in the initialization phase of GenericUDF by initialize() method. By observing the execution process of tez, we found that the master would load the data first, and when workers request the HDFS data, they would transfer the HDFS data through RPC. If there were many workers, there would be many RPC requests too. For each request, the master of Tez copied the ByteBuffer once, and each ByteBuffer [] is about 100MB. Therefore, it is very easy to cause the Master OOM, and then cause the HDFS data loading of the worker to fail.

 

 

The HDFS files required for UDF adjustment are loaded in the evaluation phase by evaluate() method, and each JVM is loaded just once and only once. In this way, HDFS files are finally loaded through the worker, thus avoiding RPC communication of data transmission between the master and the worker.

> tez unable to control the memory size when UDF occupies 100MB memory 
> ---------------------------------------------------------------------
>
>                 Key: TEZ-4442
>                 URL: https://issues.apache.org/jira/browse/TEZ-4442
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>         Environment: CDP7.1.7SP1
> tez 0.9.1
> hive 3.1.3
>  
>            Reporter: Authur Wang
>            Priority: Critical
>         Attachments: app.log, application_1659706606596_0047.log.gz, hiveserver2.out, java heap1.png, java heap2.png, spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip
>
>
>           We have a UDF which loads about 5 million records into memory, and matchs the data in the memory according to the user's input, and finally return the output. Each input record of the UDF will lead to one output.
>           Based on heapdump analysis, this  udf occupies about 100MB of memory. The UDF runs stably in hive on MR, hive on spark and native spark, and only needs about 4GB of memory for that situation. However, if we use tez engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the memory to 12g, the task will fail with a high probability. Why does tez engine need so much memory compared to Mr and spark? Is there a good tuning method to control the amount of memory ?
>  
>  
> command is as follows:
> beeline -u 'jdbc:hive2://bg21146.hadoop.com:10000/default;principal=hive/[bg21146.hadoop.com@BG.COM|mailto:bg21146.hadoop.com@BG.COM]' --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
>  
> create temporary function get_card_rank as 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
>  
> set tez.am.log.level=debug;
> set tez.am.resource.memory.mb=8192;
> set hive.tez.container.size=8192;
> set tez.task.resource.memory.mb=2048;
> set tez.runtime.io.sort.mb=1200;
> set hive.auto.convert.join.noconditionaltask.size=500000000;
> set tez.runtime.unordered.output.buffer.size-mb=800;
> set tez.grouping.min-size=33554432;
> set tez.grouping.max-size=536870912;
> set hive.tez.auto.reducer.parallelism=true;
> set hive.tez.min.partition.factor=0.25;
> set hive.tez.max.partition.factor=2.0;
> set hive.exec.reducers.bytes.per.reducer=268435456;
> set mapreduce.map.memory.mb=4096;
> set ipc.maximum.response.length=1536000000;
>  
>  
> select
>  get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
>  count(\*)
> from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
> where a.hp_settle_dt = '20200910'
> group by get_card_rank(ext_pri_acct_no)
> ;
> "
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)