You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by 不清 <45...@qq.com> on 2017/02/13 16:19:41 UTC

回复： kylin job stop accidentally and can resume success!

how can i get this heap size？




------------------ 原始邮件 ------------------
发件人: "Alberto Ramón";<a....@gmail.com>;
发送时间: 2017年2月14日(星期二) 凌晨0:17
收件人: "user"<us...@kylin.apache.org>; 

主题: Re: kylin job stop accidentally and can resume success!



Sounds like a problem of Resource Manager (RM) of YARN, check the Heap size for RM

Kylin loose connectivity whit RM


2017-02-13 17:00 GMT+01:00 不清 <45...@qq.com>:
hello,kylin community!


sometimes my jobs stop accidenttly.It is can stop by any step.


kylin log is like :
2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] hbase.HBaseResourceStore:262 : Update row /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true
2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,495 INFO  [pool-8-thread-8] mapred.ClientServiceDelegate:273 : Application state is completed. FinalApplicationStatus=KILLED. Redirecting to job history server
2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02



CM log is like:
Job Name:	Kylin_Cube_Builder_user_all_cube_2_only_msisdn
User Name:	tmn
Queue:	root.tmn
State:	KILLED
Uberized:	false
Submitted:	Sun Feb 12 19:19:24 CST 2017
Started:	Sun Feb 12 19:19:38 CST 2017
Finished:	Sun Feb 12 20:30:13 CST 2017
Elapsed:	1hrs, 10mins, 35sec
Diagnostics:	
Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at 10.180.212.38
Job received Kill while in RUNNING state.
Average Map Time	24mins, 48sec



mapreduce job log
Task KILL is received. Killing attempt!


and when this happened ,by resume job,the job can resume success! I mean  it is not stop by error!


what's the problem?


My hadoop cluster is very busy,this situation happens very often.


can I set retry time and retry  Interval?

回复： kylin job stop accidentally and can resume success!

Posted by 不清 <45...@qq.com>.

no，resource manager is in the master node







------------------ 原始邮件 ------------------
发件人: "Alberto Ramón";<a....@gmail.com>;
发送时间: 2017年2月14日(星期二) 上午6:05
收件人: "user"<us...@kylin.apache.org>; 

主题: Re: kylin job stop accidentally and can resume success!



Do you have the Resource Manager in a dedicated node ?(without container or Node Manager)


2017-02-13 17:38 GMT+01:00 不清 <45...@qq.com>:
I check the configure in CM。


Java Heap Size of ResourceManager in Bytes =1536 MiB
Container Memory Minimum =1GiB


Container Memory Increment =512MiB


Container Memory Maximum =8GiB


------------------ 原始邮件 ------------------
发件人: "Alberto Ramón";<a....@gmail.com>;
发送时间: 2017年2月14日(星期二) 凌晨0:34
收件人: "user"<us...@kylin.apache.org>; 

主题: Re: kylin job stop accidentally and can resume success!





check this: "Basically, it means RM can only allocate memory to containers in increments of .  . . "


TIP: is your RM in a work node? If this is true, this can be the problem

(Its good idea put yarn master, RM, in a dedicated node)






2017-02-13 17:19 GMT+01:00 不清 <45...@qq.com>:
how can i get this heap size？




------------------ 原始邮件 ------------------
发件人: "Alberto Ramón";<a....@gmail.com>;
发送时间: 2017年2月14日(星期二) 凌晨0:17
收件人: "user"<us...@kylin.apache.org>; 

主题: Re: kylin job stop accidentally and can resume success!



Sounds like a problem of Resource Manager (RM) of YARN, check the Heap size for RM

Kylin loose connectivity whit RM


2017-02-13 17:00 GMT+01:00 不清 <45...@qq.com>:
hello,kylin community!


sometimes my jobs stop accidenttly.It is can stop by any step.


kylin log is like :
2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] hbase.HBaseResourceStore:262 : Update row /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true
2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,495 INFO  [pool-8-thread-8] mapred.ClientServiceDelegate:273 : Application state is completed. FinalApplicationStatus=KILLED. Redirecting to job history server
2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02



CM log is like:
Job Name:	Kylin_Cube_Builder_user_all_cube_2_only_msisdn
User Name:	tmn
Queue:	root.tmn
State:	KILLED
Uberized:	false
Submitted:	Sun Feb 12 19:19:24 CST 2017
Started:	Sun Feb 12 19:19:38 CST 2017
Finished:	Sun Feb 12 20:30:13 CST 2017
Elapsed:	1hrs, 10mins, 35sec
Diagnostics:	
Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at 10.180.212.38
Job received Kill while in RUNNING state.
Average Map Time	24mins, 48sec



mapreduce job log
Task KILL is received. Killing attempt!


and when this happened ,by resume job,the job can resume success! I mean  it is not stop by error!


what's the problem?


My hadoop cluster is very busy,this situation happens very often.


can I set retry time and retry  Interval?

Re: kylin job stop accidentally and can resume success!

Posted by Alberto Ramón <a....@gmail.com>.

Do you have the Resource Manager in a dedicated node ?(without container or
Node Manager)

2017-02-13 17:38 GMT+01:00 不清 <45...@qq.com>:

> I check the configure in CM。
>
> Java Heap Size of ResourceManager in Bytes =1536 MiB
> Container Memory Minimum =1GiB
>
> Container Memory Increment =512MiB
>
> Container Memory Maximum =8GiB
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Alberto Ramón";<a....@gmail.com>;
> *发送时间:* 2017年2月14日(星期二) 凌晨0:34
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: kylin job stop accidentally and can resume success!
>
> check this
> <https://www.mapr.com/blog/best-practices-yarn-resource-management>:
> "Basically, it means RM can only allocate memory to containers in
> increments of .  . . "
>
> TIP: is your RM in a work node? If this is true, this can be the problem
> (Its good idea put yarn master, RM, in a dedicated node)
>
>
> 2017-02-13 17:19 GMT+01:00 不清 <45...@qq.com>:
>
>> how can i get this heap size？
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Alberto Ramón";<a....@gmail.com>;
>> *发送时间:* 2017年2月14日(星期二) 凌晨0:17
>> *收件人:* "user"<us...@kylin.apache.org>;
>> *主题:* Re: kylin job stop accidentally and can resume success!
>>
>> Sounds like a problem of Resource Manager (RM) of YARN, check the Heap
>> size for RM
>> Kylin loose connectivity whit RM
>>
>> 2017-02-13 17:00 GMT+01:00 不清 <45...@qq.com>:
>>
>>> hello,kylin community!
>>>
>>> sometimes my jobs stop accidenttly.It is can stop by any step.
>>>
>>> kylin log is like :
>>> 2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8]
>>> hbase.HBaseResourceStore:262 : Update row /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02
>>> from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true
>>> 2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 :
>>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504.
>>> Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>>> sleepTime=1000 MILLISECONDS)
>>> 2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 :
>>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504.
>>> Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>>> sleepTime=1000 MILLISECONDS)
>>> 2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 :
>>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504.
>>> Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>>> sleepTime=1000 MILLISECONDS)
>>> 2017-02-13 23:27:15,495 INFO  [pool-8-thread-8]
>>> mapred.ClientServiceDelegate:273 : Application state is completed.
>>> FinalApplicationStatus=KILLED. Redirecting to job history server
>>> 2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 :
>>> updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02
>>>
>>> CM log is like:
>>> Job Name: Kylin_Cube_Builder_user_all_cube_2_only_msisdn
>>> User Name: tmn
>>> Queue: root.tmn
>>> State: KILLED
>>> Uberized: false
>>> Submitted: Sun Feb 12 19:19:24 CST 2017
>>> Started: Sun Feb 12 19:19:38 CST 2017
>>> Finished: Sun Feb 12 20:30:13 CST 2017
>>> Elapsed: 1hrs, 10mins, 35sec
>>> Diagnostics:
>>> Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at
>>> 10.180.212.38
>>> Job received Kill while in RUNNING state.
>>> Average Map Time 24mins, 48sec
>>>
>>> mapreduce job log
>>> Task KILL is received. Killing attempt!
>>>
>>> and when this happened ,by resume job,the job can resume success! I mean
>>>  it is not stop by error!
>>>
>>> what's the problem?
>>>
>>> My hadoop cluster is very busy,this situation happens very often.
>>>
>>> can I set retry time and retry  Interval?
>>>
>>
>>
>

回复： kylin job stop accidentally and can resume success!

Posted by 不清 <45...@qq.com>.

I check the configure in CM。


Java Heap Size of ResourceManager in Bytes =1536 MiB
Container Memory Minimum =1GiB


Container Memory Increment =512MiB


Container Memory Maximum =8GiB


------------------ 原始邮件 ------------------
发件人: "Alberto Ramón";<a....@gmail.com>;
发送时间: 2017年2月14日(星期二) 凌晨0:34
收件人: "user"<us...@kylin.apache.org>; 

主题: Re: kylin job stop accidentally and can resume success!



check this: "Basically, it means RM can only allocate memory to containers in increments of .  . . "


TIP: is your RM in a work node? If this is true, this can be the problem

(Its good idea put yarn master, RM, in a dedicated node)






2017-02-13 17:19 GMT+01:00 不清 <45...@qq.com>:
how can i get this heap size？




------------------ 原始邮件 ------------------
发件人: "Alberto Ramón";<a....@gmail.com>;
发送时间: 2017年2月14日(星期二) 凌晨0:17
收件人: "user"<us...@kylin.apache.org>; 

主题: Re: kylin job stop accidentally and can resume success!



Sounds like a problem of Resource Manager (RM) of YARN, check the Heap size for RM

Kylin loose connectivity whit RM


2017-02-13 17:00 GMT+01:00 不清 <45...@qq.com>:
hello,kylin community!


sometimes my jobs stop accidenttly.It is can stop by any step.


kylin log is like :
2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] hbase.HBaseResourceStore:262 : Update row /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true
2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,495 INFO  [pool-8-thread-8] mapred.ClientServiceDelegate:273 : Application state is completed. FinalApplicationStatus=KILLED. Redirecting to job history server
2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02



CM log is like:
Job Name:	Kylin_Cube_Builder_user_all_cube_2_only_msisdn
User Name:	tmn
Queue:	root.tmn
State:	KILLED
Uberized:	false
Submitted:	Sun Feb 12 19:19:24 CST 2017
Started:	Sun Feb 12 19:19:38 CST 2017
Finished:	Sun Feb 12 20:30:13 CST 2017
Elapsed:	1hrs, 10mins, 35sec
Diagnostics:	
Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at 10.180.212.38
Job received Kill while in RUNNING state.
Average Map Time	24mins, 48sec



mapreduce job log
Task KILL is received. Killing attempt!


and when this happened ,by resume job,the job can resume success! I mean  it is not stop by error!


what's the problem?


My hadoop cluster is very busy,this situation happens very often.


can I set retry time and retry  Interval?

Re: kylin job stop accidentally and can resume success!

Posted by Alberto Ramón <a....@gmail.com>.

check this
<https://www.mapr.com/blog/best-practices-yarn-resource-management>:
"Basically, it means RM can only allocate memory to containers in
increments of .  . . "

TIP: is your RM in a work node? If this is true, this can be the problem
(Its good idea put yarn master, RM, in a dedicated node)


2017-02-13 17:19 GMT+01:00 不清 <45...@qq.com>:

> how can i get this heap size？
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Alberto Ramón";<a....@gmail.com>;
> *发送时间:* 2017年2月14日(星期二) 凌晨0:17
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: kylin job stop accidentally and can resume success!
>
> Sounds like a problem of Resource Manager (RM) of YARN, check the Heap
> size for RM
> Kylin loose connectivity whit RM
>
> 2017-02-13 17:00 GMT+01:00 不清 <45...@qq.com>:
>
>> hello,kylin community!
>>
>> sometimes my jobs stop accidenttly.It is can stop by any step.
>>
>> kylin log is like :
>> 2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8]
>> hbase.HBaseResourceStore:262 : Update row /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02
>> from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true
>> 2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying
>> connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0
>> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>> sleepTime=1000 MILLISECONDS)
>> 2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying
>> connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1
>> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>> sleepTime=1000 MILLISECONDS)
>> 2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying
>> connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2
>> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>> sleepTime=1000 MILLISECONDS)
>> 2017-02-13 23:27:15,495 INFO  [pool-8-thread-8]
>> mapred.ClientServiceDelegate:273 : Application state is completed.
>> FinalApplicationStatus=KILLED. Redirecting to job history server
>> 2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 :
>> updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02
>>
>> CM log is like:
>> Job Name: Kylin_Cube_Builder_user_all_cube_2_only_msisdn
>> User Name: tmn
>> Queue: root.tmn
>> State: KILLED
>> Uberized: false
>> Submitted: Sun Feb 12 19:19:24 CST 2017
>> Started: Sun Feb 12 19:19:38 CST 2017
>> Finished: Sun Feb 12 20:30:13 CST 2017
>> Elapsed: 1hrs, 10mins, 35sec
>> Diagnostics:
>> Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at
>> 10.180.212.38
>> Job received Kill while in RUNNING state.
>> Average Map Time 24mins, 48sec
>>
>> mapreduce job log
>> Task KILL is received. Killing attempt!
>>
>> and when this happened ,by resume job,the job can resume success! I mean
>>  it is not stop by error!
>>
>> what's the problem?
>>
>> My hadoop cluster is very busy,this situation happens very often.
>>
>> can I set retry time and retry  Interval?
>>
>
>