You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Johannes Zillmann <jz...@googlemail.com> on 2015/08/11 10:18:10 UTC

Hanging Container on two independent graphs

Hey Tez Community,

a customer of ours has following situation. He sets up a computation pipeline in a way that 2 independent graphs are running in an Tez DAG.
Result: the job is still running after a half day with only alive task running, seems to running forever. 

If he splits the pipeline that each of the independent graph are executed in their own DAG, then each of the DAG’s are completed in one or two hours.

Here is some initial logging which might be of interest of that one never-ending task:
2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle: Shuffle assigned with 55 inputs, codec: NoneifileReadAhead: true
2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: MapChainVertex:Workbook job (184): ClickPairs#ClikDataWithPlayerSession(Disconnected record stream) (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration: maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true, maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0
2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: InitialRequest: ShuffleMem=1932735232, postMergeMem=0, RuntimeTotalAvailable=1932735232. Updated to: ShuffleMem=1932735232, postMergeMem: 0
2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: MergerManager: memoryLimit=1932735232, maxSingleShuffleLimit=483183808, mergeThreshold=1739461632, ioSortFactor=100, memToMemMergeOutputsThreshold=100


And then there are hours full of those things:
2015-08-03 13:33:59,911 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,911 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms
2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned Status.WAIT ...
2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms
2015-08-03 13:34:06,125 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for url=http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009 sent hash and receievd reply 0 ms

Focusing on an individual fetcher it looks like this over time:
2015-08-03 13:46:45,977 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
…
2015-08-03 14:39:44,974 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
…
2015-08-03 15:27:32,739 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
…
2015-08-03 17:42:28,354 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …


That behaviour is consistent for the customer.
Using Tez-0.6…
Any ideas what could cause this ?

Johannes

Re: Hanging Container on two independent graphs

Posted by Johannes Zillmann <jz...@googlemail.com>.
Thanks,

it’s 0.6.0 and it looks like it is in that state forever. Job got killed after 7 hours.

Johannes

> On 11 Aug 2015, at 11:21, Rajesh Balamohan <rb...@apache.org> wrote:
> 
> Is it 0.6.0 or 0.6.1? Does that fetcher get locked up in Status.WAIT forever?
> 
> This could be related to https://issues.apache.org/jira/browse/TEZ-1923 <https://issues.apache.org/jira/browse/TEZ-1923> (fix available in 0.6.1).
> 
> ~Rajesh.B
> 
> On Tue, Aug 11, 2015 at 1:48 PM, Johannes Zillmann <jzillmann@googlemail.com <ma...@googlemail.com>> wrote:
> Hey Tez Community,
> 
> a customer of ours has following situation. He sets up a computation pipeline in a way that 2 independent graphs are running in an Tez DAG.
> Result: the job is still running after a half day with only alive task running, seems to running forever. 
> 
> If he splits the pipeline that each of the independent graph are executed in their own DAG, then each of the DAG’s are completed in one or two hours.
> 
> Here is some initial logging which might be of interest of that one never-ending task:
> 2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle: Shuffle assigned with 55 inputs, codec: NoneifileReadAhead: true
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex: MapChainVertex:Workbook job (184): ClickPairs#ClikDataWithPlayerSession(Disconnected record stream) (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration: maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true, maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: InitialRequest: ShuffleMem=1932735232 <tel:1932735232>, postMergeMem=0, RuntimeTotalAvailable=1932735232 <tel:1932735232>. Updated to: ShuffleMem=1932735232 <tel:1932735232>, postMergeMem: 0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: MergerManager: memoryLimit=1932735232 <tel:1932735232>, maxSingleShuffleLimit=483183808, mergeThreshold=1739461632 <tel:1739461632>, ioSortFactor=100, memToMemMergeOutputsThreshold=100
> 
> 
> And then there are hours full of those things:
> 2015-08-03 13:33:59,911 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,911 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms
> 2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,913 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,914 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,916 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,917 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,919 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,920 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #36] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned Status.WAIT ...
> 2015-08-03 13:33:59,922 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #25] orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms
> 2015-08-03 13:34:06,125 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for url=http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009 <http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009> sent hash and receievd reply 0 ms
> 
> Focusing on an individual fetcher it looks like this over time:
> 2015-08-03 13:46:45,977 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
> …
> 2015-08-03 14:39:44,974 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
> …
> 2015-08-03 15:27:32,739 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
> …
> 2015-08-03 17:42:28,354 INFO [fetcher [MapChainVertex_Workbook_job__184___Click] #31] orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned Status.WAIT …
> 
> 
> That behaviour is consistent for the customer.
> Using Tez-0.6…
> Any ideas what could cause this ?
> 
> Johannes
> 


Re: Hanging Container on two independent graphs

Posted by Rajesh Balamohan <rb...@apache.org>.
Is it 0.6.0 or 0.6.1? Does that fetcher get locked up in Status.WAIT
forever?

This could be related to https://issues.apache.org/jira/browse/TEZ-1923
(fix available in 0.6.1).

~Rajesh.B

On Tue, Aug 11, 2015 at 1:48 PM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Hey Tez Community,
>
> a customer of ours has following situation. He sets up a computation
> pipeline in a way that 2 independent graphs are running in an Tez DAG.
> Result: the job is still running after a half day with only alive task
> running, seems to running forever.
>
> If he splits the pipeline that each of the independent graph are executed
> in their own DAG, then each of the DAG’s are completed in one or two hours.
>
> Here is some initial logging which might be of interest of that one
> never-ending task:
> 2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle:
> Shuffle assigned with 55 inputs, codec: NoneifileReadAhead: true
> 2015-08-03 13:20:12,353 INFO [Initializer 1]
> orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex:
> MapChainVertex:Workbook job (184):
> ClickPairs#ClikDataWithPlayerSession(Disconnected record stream)
> (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration:
> maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true,
> maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager:
> InitialRequest: ShuffleMem=1932735232, postMergeMem=0,
> RuntimeTotalAvailable=1932735232. Updated to: ShuffleMem=1932735232,
> postMergeMem: 0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager:
> MergerManager: memoryLimit=1932735232, maxSingleShuffleLimit=483183808,
> mergeThreshold=1739461632, ioSortFactor=100,
> memToMemMergeOutputsThreshold=100
>
>
> And then there are hours full of those things:
> 2015-08-03 13:33:59,911 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,911 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms
> 2015-08-03 13:33:59,913 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,913 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,913 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,914 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,914 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,914 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,916 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,916 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,916 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,917 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,917 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,917 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,919 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,919 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,919 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,920 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,920 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,920 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,922 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,922 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #25]
> orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,922 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #25]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms
> 2015-08-03 13:34:06,125 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for
> url=
> http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009
> sent hash and receievd reply 0 ms
>
> Focusing on an individual fetcher it looks like this over time:
> 2015-08-03 13:46:45,977 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
> …
> 2015-08-03 14:39:44,974 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
> …
> 2015-08-03 15:27:32,739 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
> …
> 2015-08-03 17:42:28,354 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
>
>
> That behaviour is consistent for the customer.
> Using Tez-0.6…
> Any ideas what could cause this ?
>
> Johannes
>