You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by 娄东风 <25...@qq.com> on 2020/06/01 03:36:33 UTC

tez shuffle fetch phase has long pause

Hi
&nbsp; I'm using Tez 0.9.1 and Hive 2.3.3. Running TPC-DS query15 with 1TB.
In Reduce5, i see long pause during fetch occasionally so it's hard to make a Jstack.
Reduce5 depends on Map4 and Reduce2,these two vertexes finished before 16:33:00.
So the fetch task should not be in wait state for upstream vertexes.

How do i find out what makes this long pause? 

&nbsp; thanks.

2020-05-30 16:36:29,531 [INFO] [Fetcher_B {Map_4} #12] |ShuffleManager.fetch|: Completed fetch for attempt: {5, 0, attempt_1590728138875_0202_1_01_000005_0_10003} to MEMORY, csize=24509, dsize=66456, EndTime=1590827789531, TimeTaken=1, Rate=23.37 MB/s 2020-05-30 16:37:41,368 [INFO] [Fetcher_B {Map_4} #2] |HttpConnection.url|: for url=http://node-ana-coreLKpD0001:13562/mapOutput?job=job_1590728138875_0202&amp;dag=1&amp;reduce=245&amp;map=attempt_1590728138875_0202_1_01_000027_0_10002,attempt_1590728138875_0202_1_01_000023_0_10003,attempt_1590728138875_0202_1_01_000030_0_10002,attempt_1590728138875_0202_1_01_000025_0_10003,attempt_1590728138875_0202_1_01_000029_0_10003 sent hash and receievd reply 0 ms

Re: tez shuffle fetch phase has long pause

Posted by Kuhu Shukla <ks...@ncsu.edu>.

Thank you for reporting this.

You might want to check the logs and other metrics of the node of the
upstream task to see if it is healthy or not. The download rate seems quite
low to me and depending on the fetch size , we can verify if this is
because of poor network bandwidth and/or bad node health.

Let us know what you think and find out.
Regards,
Kuhu

On Sun, May 31, 2020, 22:36 娄东风 <25...@qq.com> wrote:

> Hi
>   I'm using Tez 0.9.1 and Hive 2.3.3. Running TPC-DS query15 with 1TB.
> In Reduce5, i see long pause during fetch occasionally so it's hard to
> make a Jstack.
> Reduce5 depends on Map4 and Reduce2,these two vertexes finished before
> 16:33:00.
> So the fetch task should not be in wait state for upstream vertexes.
> How do i find out what makes this long pause?
>   thanks.
>
> 2020-05-30 16:36:29,531 [INFO] [Fetcher_B {Map_4} #12] |ShuffleManager.fetch|: Completed fetch for attempt: {5, 0, attempt_1590728138875_0202_1_01_000005_0_10003} to MEMORY, csize=24509, dsize=66456, EndTime=1590827789531, TimeTaken=1, Rate=23.37 MB/s
> 2020-05-30 16:37:41,368 [INFO] [Fetcher_B {Map_4} #2] |HttpConnection.url|: for url=http://node-ana-coreLKpD0001:13562/mapOutput?job=job_1590728138875_0202&dag=1&reduce=245&map=attempt_1590728138875_0202_1_01_000027_0_10002,attempt_1590728138875_0202_1_01_000023_0_10003,attempt_1590728138875_0202_1_01_000030_0_10002,attempt_1590728138875_0202_1_01_000025_0_10003,attempt_1590728138875_0202_1_01_000029_0_10003 sent hash and receievd reply 0 ms
>
>
>

Re: tez shuffle fetch phase has long pause

Posted by 娄东风 <25...@qq.com>.

Correct! It is TCP listen overflow issue!
Thanks for your help, Gopal V and Kuhu!



Using below command, i can see many overflows.


# netstat -s | grep -i overflow


[root@node-ana-coreLKpD0001 ~]# netstat -s | grep -i overflow
&nbsp;&nbsp;&nbsp; 96282 times the listen queue of a socket overflowed
&nbsp;&nbsp;&nbsp; TCPTimeWaitOverflow: 2499680



net.core.somaxconn default is 128 on my cluster, which is too small.


I ran about 10 queries concurrently with the following settings
set hive.tez.auto.reducer.parallelism=true
which made the situation worse.



Follow Gopal V's instructions below ,fetch phase long pause disappeared.


------------------&nbsp;Original&nbsp;------------------
From: &nbsp;"Gopal V";<gopalv@apache.org&gt;;
Send time:&nbsp;Monday, Jun 1, 2020 1:49 PM
To:&nbsp;"user"<user@tez.apache.org&gt;; 

Subject: &nbsp;Re: tez shuffle fetch phase has long pause





&gt; In Reduce5, i see long pause during fetch occasionally 

This is likely the TCP listen overflow issue, but just doesn't get 
reported as a packet loss issue because the retry works okay.

https://issues.apache.org/jira/browse/MAPREDUCE-6763

That's the fix to be applied on the YARN Shuffle handler.

You can confirm the change by running

# ss -tln

I usually diagnose it by checking for TCP cookies in the dmesg or 
looking at the snmp data.

# netstat -s | grep -i overflow


This issue also affects HDFS namenode, which is also usually unreported 
by users.

https://issues.apache.org/jira/browse/HADOOP-16504

The delay is usually 2 * tcp max-segment-length &amp; is usually reduced by 
increasing the OS half-open connection count.

I end up doing

# sysctl -w net.core.somaxconn=16384
# sysctl -w net.ipv4.tcp_fin_timeout=2

to speed up the retries &amp; restarting daemons.

This affects Tez a little worse than MRv2, because the same JVM runs 
multiple instances of the same vertex sequentially, instead of a new JVM 
for every task (which runs way slower, reducing the concurrency of 
connections).

Cheers,
Gopal

Re: tez shuffle fetch phase has long pause

Posted by Gopal V <go...@apache.org>.


> In Reduce5, i see long pause during fetch occasionally 

This is likely the TCP listen overflow issue, but just doesn't get 
reported as a packet loss issue because the retry works okay.

https://issues.apache.org/jira/browse/MAPREDUCE-6763

That's the fix to be applied on the YARN Shuffle handler.

You can confirm the change by running

# ss -tln

I usually diagnose it by checking for TCP cookies in the dmesg or 
looking at the snmp data.

# netstat -s | grep -i overflow


This issue also affects HDFS namenode, which is also usually unreported 
by users.

https://issues.apache.org/jira/browse/HADOOP-16504

The delay is usually 2 * tcp max-segment-length & is usually reduced by 
increasing the OS half-open connection count.

I end up doing

# sysctl -w net.core.somaxconn=16384
# sysctl -w net.ipv4.tcp_fin_timeout=2

to speed up the retries & restarting daemons.

This affects Tez a little worse than MRv2, because the same JVM runs 
multiple instances of the same vertex sequentially, instead of a new JVM 
for every task (which runs way slower, reducing the concurrency of 
connections).

Cheers,
Gopal