You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Krzysztof Zarzycki <k....@gmail.com> on 2015/09/30 15:50:55 UTC

YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version
2.7.1 (HDP 2.3).
The timelineserver process ( more
precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
class) takes over 600% of CPU, generating enormous load on my master node.
I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an
issue. But no, I started timelineserver now with use of java 7 and still
the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please
write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
-Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
-Dyarn.log.dir=/var/log/hadoop-yarn/yarn
-Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
-Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
-Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
-Dyarn.root.logger=INFO,EWMA,RFA
-Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
-Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
-Dyarn.log.dir=/var/log/hadoop-yarn/yarn
-Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
-Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
-Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
-Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
-Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
-Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
-classpath
/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by 郭士伟 <gu...@gmail.com>.

It seems that it's the large leveldb size that cause the problem. What is
the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
enough so we have too much entities in timeline store.
And by the way, it will take a long time (hours) when the ATS do discard
old entity operation, and it will also block the other operations. The
patch https://issues.apache.org/jira/browse/YARN-3448 is a great
performance improve. We just backport it and it works well.

2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi Krzysiek,
>
>
>
> *There are currently 8 Spark Streaming jobs constantly running, each 3
> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
> ATS.  How could I check what precisely is doing what or how to get some
> logs about it, I don't know...*
>
> Not sure about the applications being run and if you have already tried
> disabling the "Spark History Server doing the puts ATS" then not sure if
> the apps are sending it out . AFAIK Spark history server had not integrated
> with ATS (SPARK-1537). So most propably its the applications which are
> pumping in the data. I think you need to check with them itself.
>
>
> *2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load? *
>
> Its not been used in large scale by us but according YARN-2556 (ATS
> Performance Test Tool), it states that "On a 36 node cluster, this
> results in ~830 concurrent containers (e.g maps), each firing 10KB of
> payload, 20 times." but only thing being different is, data in your
> system is already overloaded hence cost of querying (which is currently
> happening during each insertion) is very high.
>
> May be guys from other company who have used or supported ATSV1 might be
> able to tell the ATSV1 scale better !
>
>
> Regards,
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Thursday, November 05, 2015 19:51
> *To:* user@hadoop.apache.org
> *Subject:* Re: YARN timelineserver process taking 600% CPU
>
> Thanks Naga for your input,  (I'm sorry for a late response, I was out for
> some time).
>
> So you believe that Spark is actually doing the PUTs? There are currently
> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
> 10 s. I believe these are the jobs that publish to ATS.  How could I check
> what precisely is doing what or how to get some logs about it, I don't
> know...
> I though maybe it is Spark History Server doing the puts, but it seems it
> is not, as I disabled it and the load hasn't gone down. So it seems these
> are the jobs itself indeed.
>
> Now I have the following problems:
> 1. The most important: How can I at least *workaround* this issue? Maybe
> I will somehow disable Spark usage of Yarn timelineserver ? What are the
> consequences? Is it only history of Spark finished jobs not being saved? If
> yes, that doesn't hurt that much. Probably this is a question to Spark
> group...
> 2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load?
>
> Thanks for helping me with this!
> Krzysiek
>
>
>
>
>
>
>
>
>
>
> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> Oops My mistake, 3 Gb seems to be on little higher side.
>> And from the jstack it seems like there were no major activity other than
>> puts seems like around 16 concurrent puts were happening which tries to get
>> the timeline Entity hence hitting the native call.
>>
>> From the logs it seems like lot of ACL validations are happening and from
>> the URL it seems like its for PUTEntites.
>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>> and if all of these are for puts then roughly about 10 put calls/s is
>> happening from *spark* side. This i feel is not right usage of ATS, can
>> you check what is being published from the spark to ATS at this high rate ?
>>
>> Besides some improvements regarding the timeline metrics is available in
>> trunk as part of YARN-3360 which could have been useful in analyzing your
>> issue.
>>
>> + Naga
>>
>>
>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
>> wrote:
>>
>>> Hi Naga,
>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>> numbers in kB). Does that seems reasonable as well?
>>> There are new .sst files generated each minute.
>>> There are now 26850 files in leveldb-timeline-store directory. New files
>>> are generated each minute. Some are also being deleted.
>>>
>>> I started timeline server today, to gather logs and jstack, it was
>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>
>>> Thank you for helping me debug this.
>>> Krzysiek
>>>
>>>
>>>
>>>
>>>
>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>> naganarasimha.gr@gmail.com>:
>>>
>>>> Hi Krzysiek,
>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>> if possible the jstack output for the AHS process
>>>>
>>>> + Naga
>>>>
>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>> k.zarzycki@gmail.com> wrote:
>>>>
>>>>> Hi Naga,
>>>>> I see the following size:
>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>
>>>>> The timeline service has been multiple times restarted as I was
>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>> few applications (1?2? ) has been started since its last start. The
>>>>> ResourceManager interface has 261 entries.
>>>>>
>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>> has the following value:
>>>>>
>>>>> <property>
>>>>>
>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>       <value>300000</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>> *think* it is related to leveldb store.
>>>>>
>>>>> Please ask if any more information is needed.
>>>>> Any help is appreciated! Thanks
>>>>> Krzysiek
>>>>>
>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>> garlanaganarasimha@huawei.com>:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Whats the size of Store Files?
>>>>>> Since when is it running ? how many applications have been run since
>>>>>> it has been started ?
>>>>>> Whats the value of "
>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>
>>>>>> + Naga
>>>>>> ------------------------------
>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>
>>>>>> Hi there Hadoopers,
>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>> version 2.7.1 (HDP 2.3).
>>>>>> The timelineserver process ( more
>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>> I can't guess why it happens.
>>>>>>
>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>> the problem is the same.
>>>>>>
>>>>>> My cluster is tiny- it consists of:
>>>>>> - 2 HDFS nodes
>>>>>> - 2 HBase RegionServers
>>>>>> - 2 Kafkas
>>>>>> - 2 Spark nodes
>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>> TOTAL.
>>>>>>
>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>> please write.
>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>> server.
>>>>>>
>>>>>> And here is a command of timeline that I see from ps :
>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -classpath
>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Krzysztof
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by 郭士伟 <gu...@gmail.com>.

It seems that it's the large leveldb size that cause the problem. What is
the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
enough so we have too much entities in timeline store.
And by the way, it will take a long time (hours) when the ATS do discard
old entity operation, and it will also block the other operations. The
patch https://issues.apache.org/jira/browse/YARN-3448 is a great
performance improve. We just backport it and it works well.

2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi Krzysiek,
>
>
>
> *There are currently 8 Spark Streaming jobs constantly running, each 3
> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
> ATS.  How could I check what precisely is doing what or how to get some
> logs about it, I don't know...*
>
> Not sure about the applications being run and if you have already tried
> disabling the "Spark History Server doing the puts ATS" then not sure if
> the apps are sending it out . AFAIK Spark history server had not integrated
> with ATS (SPARK-1537). So most propably its the applications which are
> pumping in the data. I think you need to check with them itself.
>
>
> *2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load? *
>
> Its not been used in large scale by us but according YARN-2556 (ATS
> Performance Test Tool), it states that "On a 36 node cluster, this
> results in ~830 concurrent containers (e.g maps), each firing 10KB of
> payload, 20 times." but only thing being different is, data in your
> system is already overloaded hence cost of querying (which is currently
> happening during each insertion) is very high.
>
> May be guys from other company who have used or supported ATSV1 might be
> able to tell the ATSV1 scale better !
>
>
> Regards,
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Thursday, November 05, 2015 19:51
> *To:* user@hadoop.apache.org
> *Subject:* Re: YARN timelineserver process taking 600% CPU
>
> Thanks Naga for your input,  (I'm sorry for a late response, I was out for
> some time).
>
> So you believe that Spark is actually doing the PUTs? There are currently
> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
> 10 s. I believe these are the jobs that publish to ATS.  How could I check
> what precisely is doing what or how to get some logs about it, I don't
> know...
> I though maybe it is Spark History Server doing the puts, but it seems it
> is not, as I disabled it and the load hasn't gone down. So it seems these
> are the jobs itself indeed.
>
> Now I have the following problems:
> 1. The most important: How can I at least *workaround* this issue? Maybe
> I will somehow disable Spark usage of Yarn timelineserver ? What are the
> consequences? Is it only history of Spark finished jobs not being saved? If
> yes, that doesn't hurt that much. Probably this is a question to Spark
> group...
> 2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load?
>
> Thanks for helping me with this!
> Krzysiek
>
>
>
>
>
>
>
>
>
>
> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> Oops My mistake, 3 Gb seems to be on little higher side.
>> And from the jstack it seems like there were no major activity other than
>> puts seems like around 16 concurrent puts were happening which tries to get
>> the timeline Entity hence hitting the native call.
>>
>> From the logs it seems like lot of ACL validations are happening and from
>> the URL it seems like its for PUTEntites.
>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>> and if all of these are for puts then roughly about 10 put calls/s is
>> happening from *spark* side. This i feel is not right usage of ATS, can
>> you check what is being published from the spark to ATS at this high rate ?
>>
>> Besides some improvements regarding the timeline metrics is available in
>> trunk as part of YARN-3360 which could have been useful in analyzing your
>> issue.
>>
>> + Naga
>>
>>
>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
>> wrote:
>>
>>> Hi Naga,
>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>> numbers in kB). Does that seems reasonable as well?
>>> There are new .sst files generated each minute.
>>> There are now 26850 files in leveldb-timeline-store directory. New files
>>> are generated each minute. Some are also being deleted.
>>>
>>> I started timeline server today, to gather logs and jstack, it was
>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>
>>> Thank you for helping me debug this.
>>> Krzysiek
>>>
>>>
>>>
>>>
>>>
>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>> naganarasimha.gr@gmail.com>:
>>>
>>>> Hi Krzysiek,
>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>> if possible the jstack output for the AHS process
>>>>
>>>> + Naga
>>>>
>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>> k.zarzycki@gmail.com> wrote:
>>>>
>>>>> Hi Naga,
>>>>> I see the following size:
>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>
>>>>> The timeline service has been multiple times restarted as I was
>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>> few applications (1?2? ) has been started since its last start. The
>>>>> ResourceManager interface has 261 entries.
>>>>>
>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>> has the following value:
>>>>>
>>>>> <property>
>>>>>
>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>       <value>300000</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>> *think* it is related to leveldb store.
>>>>>
>>>>> Please ask if any more information is needed.
>>>>> Any help is appreciated! Thanks
>>>>> Krzysiek
>>>>>
>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>> garlanaganarasimha@huawei.com>:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Whats the size of Store Files?
>>>>>> Since when is it running ? how many applications have been run since
>>>>>> it has been started ?
>>>>>> Whats the value of "
>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>
>>>>>> + Naga
>>>>>> ------------------------------
>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>
>>>>>> Hi there Hadoopers,
>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>> version 2.7.1 (HDP 2.3).
>>>>>> The timelineserver process ( more
>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>> I can't guess why it happens.
>>>>>>
>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>> the problem is the same.
>>>>>>
>>>>>> My cluster is tiny- it consists of:
>>>>>> - 2 HDFS nodes
>>>>>> - 2 HBase RegionServers
>>>>>> - 2 Kafkas
>>>>>> - 2 Spark nodes
>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>> TOTAL.
>>>>>>
>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>> please write.
>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>> server.
>>>>>>
>>>>>> And here is a command of timeline that I see from ps :
>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -classpath
>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Krzysztof
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by 郭士伟 <gu...@gmail.com>.

It seems that it's the large leveldb size that cause the problem. What is
the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
enough so we have too much entities in timeline store.
And by the way, it will take a long time (hours) when the ATS do discard
old entity operation, and it will also block the other operations. The
patch https://issues.apache.org/jira/browse/YARN-3448 is a great
performance improve. We just backport it and it works well.

2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi Krzysiek,
>
>
>
> *There are currently 8 Spark Streaming jobs constantly running, each 3
> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
> ATS.  How could I check what precisely is doing what or how to get some
> logs about it, I don't know...*
>
> Not sure about the applications being run and if you have already tried
> disabling the "Spark History Server doing the puts ATS" then not sure if
> the apps are sending it out . AFAIK Spark history server had not integrated
> with ATS (SPARK-1537). So most propably its the applications which are
> pumping in the data. I think you need to check with them itself.
>
>
> *2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load? *
>
> Its not been used in large scale by us but according YARN-2556 (ATS
> Performance Test Tool), it states that "On a 36 node cluster, this
> results in ~830 concurrent containers (e.g maps), each firing 10KB of
> payload, 20 times." but only thing being different is, data in your
> system is already overloaded hence cost of querying (which is currently
> happening during each insertion) is very high.
>
> May be guys from other company who have used or supported ATSV1 might be
> able to tell the ATSV1 scale better !
>
>
> Regards,
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Thursday, November 05, 2015 19:51
> *To:* user@hadoop.apache.org
> *Subject:* Re: YARN timelineserver process taking 600% CPU
>
> Thanks Naga for your input,  (I'm sorry for a late response, I was out for
> some time).
>
> So you believe that Spark is actually doing the PUTs? There are currently
> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
> 10 s. I believe these are the jobs that publish to ATS.  How could I check
> what precisely is doing what or how to get some logs about it, I don't
> know...
> I though maybe it is Spark History Server doing the puts, but it seems it
> is not, as I disabled it and the load hasn't gone down. So it seems these
> are the jobs itself indeed.
>
> Now I have the following problems:
> 1. The most important: How can I at least *workaround* this issue? Maybe
> I will somehow disable Spark usage of Yarn timelineserver ? What are the
> consequences? Is it only history of Spark finished jobs not being saved? If
> yes, that doesn't hurt that much. Probably this is a question to Spark
> group...
> 2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load?
>
> Thanks for helping me with this!
> Krzysiek
>
>
>
>
>
>
>
>
>
>
> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> Oops My mistake, 3 Gb seems to be on little higher side.
>> And from the jstack it seems like there were no major activity other than
>> puts seems like around 16 concurrent puts were happening which tries to get
>> the timeline Entity hence hitting the native call.
>>
>> From the logs it seems like lot of ACL validations are happening and from
>> the URL it seems like its for PUTEntites.
>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>> and if all of these are for puts then roughly about 10 put calls/s is
>> happening from *spark* side. This i feel is not right usage of ATS, can
>> you check what is being published from the spark to ATS at this high rate ?
>>
>> Besides some improvements regarding the timeline metrics is available in
>> trunk as part of YARN-3360 which could have been useful in analyzing your
>> issue.
>>
>> + Naga
>>
>>
>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
>> wrote:
>>
>>> Hi Naga,
>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>> numbers in kB). Does that seems reasonable as well?
>>> There are new .sst files generated each minute.
>>> There are now 26850 files in leveldb-timeline-store directory. New files
>>> are generated each minute. Some are also being deleted.
>>>
>>> I started timeline server today, to gather logs and jstack, it was
>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>
>>> Thank you for helping me debug this.
>>> Krzysiek
>>>
>>>
>>>
>>>
>>>
>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>> naganarasimha.gr@gmail.com>:
>>>
>>>> Hi Krzysiek,
>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>> if possible the jstack output for the AHS process
>>>>
>>>> + Naga
>>>>
>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>> k.zarzycki@gmail.com> wrote:
>>>>
>>>>> Hi Naga,
>>>>> I see the following size:
>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>
>>>>> The timeline service has been multiple times restarted as I was
>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>> few applications (1?2? ) has been started since its last start. The
>>>>> ResourceManager interface has 261 entries.
>>>>>
>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>> has the following value:
>>>>>
>>>>> <property>
>>>>>
>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>       <value>300000</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>> *think* it is related to leveldb store.
>>>>>
>>>>> Please ask if any more information is needed.
>>>>> Any help is appreciated! Thanks
>>>>> Krzysiek
>>>>>
>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>> garlanaganarasimha@huawei.com>:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Whats the size of Store Files?
>>>>>> Since when is it running ? how many applications have been run since
>>>>>> it has been started ?
>>>>>> Whats the value of "
>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>
>>>>>> + Naga
>>>>>> ------------------------------
>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>
>>>>>> Hi there Hadoopers,
>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>> version 2.7.1 (HDP 2.3).
>>>>>> The timelineserver process ( more
>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>> I can't guess why it happens.
>>>>>>
>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>> the problem is the same.
>>>>>>
>>>>>> My cluster is tiny- it consists of:
>>>>>> - 2 HDFS nodes
>>>>>> - 2 HBase RegionServers
>>>>>> - 2 Kafkas
>>>>>> - 2 Spark nodes
>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>> TOTAL.
>>>>>>
>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>> please write.
>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>> server.
>>>>>>
>>>>>> And here is a command of timeline that I see from ps :
>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -classpath
>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Krzysztof
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by 郭士伟 <gu...@gmail.com>.

It seems that it's the large leveldb size that cause the problem. What is
the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
enough so we have too much entities in timeline store.
And by the way, it will take a long time (hours) when the ATS do discard
old entity operation, and it will also block the other operations. The
patch https://issues.apache.org/jira/browse/YARN-3448 is a great
performance improve. We just backport it and it works well.

2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi Krzysiek,
>
>
>
> *There are currently 8 Spark Streaming jobs constantly running, each 3
> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
> ATS.  How could I check what precisely is doing what or how to get some
> logs about it, I don't know...*
>
> Not sure about the applications being run and if you have already tried
> disabling the "Spark History Server doing the puts ATS" then not sure if
> the apps are sending it out . AFAIK Spark history server had not integrated
> with ATS (SPARK-1537). So most propably its the applications which are
> pumping in the data. I think you need to check with them itself.
>
>
> *2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load? *
>
> Its not been used in large scale by us but according YARN-2556 (ATS
> Performance Test Tool), it states that "On a 36 node cluster, this
> results in ~830 concurrent containers (e.g maps), each firing 10KB of
> payload, 20 times." but only thing being different is, data in your
> system is already overloaded hence cost of querying (which is currently
> happening during each insertion) is very high.
>
> May be guys from other company who have used or supported ATSV1 might be
> able to tell the ATSV1 scale better !
>
>
> Regards,
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Thursday, November 05, 2015 19:51
> *To:* user@hadoop.apache.org
> *Subject:* Re: YARN timelineserver process taking 600% CPU
>
> Thanks Naga for your input,  (I'm sorry for a late response, I was out for
> some time).
>
> So you believe that Spark is actually doing the PUTs? There are currently
> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
> 10 s. I believe these are the jobs that publish to ATS.  How could I check
> what precisely is doing what or how to get some logs about it, I don't
> know...
> I though maybe it is Spark History Server doing the puts, but it seems it
> is not, as I disabled it and the load hasn't gone down. So it seems these
> are the jobs itself indeed.
>
> Now I have the following problems:
> 1. The most important: How can I at least *workaround* this issue? Maybe
> I will somehow disable Spark usage of Yarn timelineserver ? What are the
> consequences? Is it only history of Spark finished jobs not being saved? If
> yes, that doesn't hurt that much. Probably this is a question to Spark
> group...
> 2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load?
>
> Thanks for helping me with this!
> Krzysiek
>
>
>
>
>
>
>
>
>
>
> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> Oops My mistake, 3 Gb seems to be on little higher side.
>> And from the jstack it seems like there were no major activity other than
>> puts seems like around 16 concurrent puts were happening which tries to get
>> the timeline Entity hence hitting the native call.
>>
>> From the logs it seems like lot of ACL validations are happening and from
>> the URL it seems like its for PUTEntites.
>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>> and if all of these are for puts then roughly about 10 put calls/s is
>> happening from *spark* side. This i feel is not right usage of ATS, can
>> you check what is being published from the spark to ATS at this high rate ?
>>
>> Besides some improvements regarding the timeline metrics is available in
>> trunk as part of YARN-3360 which could have been useful in analyzing your
>> issue.
>>
>> + Naga
>>
>>
>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
>> wrote:
>>
>>> Hi Naga,
>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>> numbers in kB). Does that seems reasonable as well?
>>> There are new .sst files generated each minute.
>>> There are now 26850 files in leveldb-timeline-store directory. New files
>>> are generated each minute. Some are also being deleted.
>>>
>>> I started timeline server today, to gather logs and jstack, it was
>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>
>>> Thank you for helping me debug this.
>>> Krzysiek
>>>
>>>
>>>
>>>
>>>
>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>> naganarasimha.gr@gmail.com>:
>>>
>>>> Hi Krzysiek,
>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>> if possible the jstack output for the AHS process
>>>>
>>>> + Naga
>>>>
>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>> k.zarzycki@gmail.com> wrote:
>>>>
>>>>> Hi Naga,
>>>>> I see the following size:
>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>
>>>>> The timeline service has been multiple times restarted as I was
>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>> few applications (1?2? ) has been started since its last start. The
>>>>> ResourceManager interface has 261 entries.
>>>>>
>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>> has the following value:
>>>>>
>>>>> <property>
>>>>>
>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>       <value>300000</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>> *think* it is related to leveldb store.
>>>>>
>>>>> Please ask if any more information is needed.
>>>>> Any help is appreciated! Thanks
>>>>> Krzysiek
>>>>>
>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>> garlanaganarasimha@huawei.com>:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Whats the size of Store Files?
>>>>>> Since when is it running ? how many applications have been run since
>>>>>> it has been started ?
>>>>>> Whats the value of "
>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>
>>>>>> + Naga
>>>>>> ------------------------------
>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>
>>>>>> Hi there Hadoopers,
>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>> version 2.7.1 (HDP 2.3).
>>>>>> The timelineserver process ( more
>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>> I can't guess why it happens.
>>>>>>
>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>> the problem is the same.
>>>>>>
>>>>>> My cluster is tiny- it consists of:
>>>>>> - 2 HDFS nodes
>>>>>> - 2 HBase RegionServers
>>>>>> - 2 Kafkas
>>>>>> - 2 Spark nodes
>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>> TOTAL.
>>>>>>
>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>> please write.
>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>> server.
>>>>>>
>>>>>> And here is a command of timeline that I see from ps :
>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -classpath
>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Krzysztof
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Krzysiek,



There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...

Not sure about the applications being run and if you have already tried disabling the "Spark History Server doing the puts ATS" then not sure if the apps are sending it out . AFAIK Spark history server had not integrated with ATS (SPARK-1537). So most propably its the applications which are pumping in the data. I think you need to check with them itself.


2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Its not been used in large scale by us but according YARN-2556 (ATS Performance Test Tool), it states that "On a 36 node cluster, this results in ~830 concurrent containers (e.g maps), each firing 10KB of payload, 20 times." but only thing being different is, data in your system is already overloaded hence cost of querying (which is currently happening during each insertion) is very high.

May be guys from other company who have used or supported ATSV1 might be able to tell the ATSV1 scale better !


Regards,

+ Naga

________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Thursday, November 05, 2015 19:51
To: user@hadoop.apache.org
Subject: Re: YARN timelineserver process taking 600% CPU

Thanks Naga for your input,  (I'm sorry for a late response, I was out for some time).

So you believe that Spark is actually doing the PUTs? There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...
I though maybe it is Spark History Server doing the puts, but it seems it is not, as I disabled it and the load hasn't gone down. So it seems these are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least workaround this issue? Maybe I will somehow disable Spark usage of Yarn timelineserver ? What are the consequences? Is it only history of Spark finished jobs not being saved? If yes, that doesn't hurt that much. Probably this is a question to Spark group...
2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than puts seems like around 16 concurrent puts were happening which tries to get the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and if all of these are for puts then roughly about 10 put calls/s is happening from spark side. This i feel is not right usage of ATS, can you check what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in trunk as part of YARN-3360 which could have been useful in analyzing your issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for issue with it. But it was installed about a 2 months ago. Just few applications (1?2? ) has been started since its last start. The ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>

Ah, One more thing: When I looked with jstack to see what the process is doing, I saw threads spending time in NATIVE in leveldbjni library. So I *think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <ga...@huawei.com>>:
Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Krzysiek,



There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...

Not sure about the applications being run and if you have already tried disabling the "Spark History Server doing the puts ATS" then not sure if the apps are sending it out . AFAIK Spark history server had not integrated with ATS (SPARK-1537). So most propably its the applications which are pumping in the data. I think you need to check with them itself.


2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Its not been used in large scale by us but according YARN-2556 (ATS Performance Test Tool), it states that "On a 36 node cluster, this results in ~830 concurrent containers (e.g maps), each firing 10KB of payload, 20 times." but only thing being different is, data in your system is already overloaded hence cost of querying (which is currently happening during each insertion) is very high.

May be guys from other company who have used or supported ATSV1 might be able to tell the ATSV1 scale better !


Regards,

+ Naga

________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Thursday, November 05, 2015 19:51
To: user@hadoop.apache.org
Subject: Re: YARN timelineserver process taking 600% CPU

Thanks Naga for your input,  (I'm sorry for a late response, I was out for some time).

So you believe that Spark is actually doing the PUTs? There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...
I though maybe it is Spark History Server doing the puts, but it seems it is not, as I disabled it and the load hasn't gone down. So it seems these are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least workaround this issue? Maybe I will somehow disable Spark usage of Yarn timelineserver ? What are the consequences? Is it only history of Spark finished jobs not being saved? If yes, that doesn't hurt that much. Probably this is a question to Spark group...
2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than puts seems like around 16 concurrent puts were happening which tries to get the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and if all of these are for puts then roughly about 10 put calls/s is happening from spark side. This i feel is not right usage of ATS, can you check what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in trunk as part of YARN-3360 which could have been useful in analyzing your issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for issue with it. But it was installed about a 2 months ago. Just few applications (1?2? ) has been started since its last start. The ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>

Ah, One more thing: When I looked with jstack to see what the process is doing, I saw threads spending time in NATIVE in leveldbjni library. So I *think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <ga...@huawei.com>>:
Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Krzysiek,



There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...

Not sure about the applications being run and if you have already tried disabling the "Spark History Server doing the puts ATS" then not sure if the apps are sending it out . AFAIK Spark history server had not integrated with ATS (SPARK-1537). So most propably its the applications which are pumping in the data. I think you need to check with them itself.


2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Its not been used in large scale by us but according YARN-2556 (ATS Performance Test Tool), it states that "On a 36 node cluster, this results in ~830 concurrent containers (e.g maps), each firing 10KB of payload, 20 times." but only thing being different is, data in your system is already overloaded hence cost of querying (which is currently happening during each insertion) is very high.

May be guys from other company who have used or supported ATSV1 might be able to tell the ATSV1 scale better !


Regards,

+ Naga

________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Thursday, November 05, 2015 19:51
To: user@hadoop.apache.org
Subject: Re: YARN timelineserver process taking 600% CPU

Thanks Naga for your input,  (I'm sorry for a late response, I was out for some time).

So you believe that Spark is actually doing the PUTs? There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...
I though maybe it is Spark History Server doing the puts, but it seems it is not, as I disabled it and the load hasn't gone down. So it seems these are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least workaround this issue? Maybe I will somehow disable Spark usage of Yarn timelineserver ? What are the consequences? Is it only history of Spark finished jobs not being saved? If yes, that doesn't hurt that much. Probably this is a question to Spark group...
2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than puts seems like around 16 concurrent puts were happening which tries to get the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and if all of these are for puts then roughly about 10 put calls/s is happening from spark side. This i feel is not right usage of ATS, can you check what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in trunk as part of YARN-3360 which could have been useful in analyzing your issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for issue with it. But it was installed about a 2 months ago. Just few applications (1?2? ) has been started since its last start. The ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>

Ah, One more thing: When I looked with jstack to see what the process is doing, I saw threads spending time in NATIVE in leveldbjni library. So I *think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <ga...@huawei.com>>:
Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Krzysiek,



There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...

Not sure about the applications being run and if you have already tried disabling the "Spark History Server doing the puts ATS" then not sure if the apps are sending it out . AFAIK Spark history server had not integrated with ATS (SPARK-1537). So most propably its the applications which are pumping in the data. I think you need to check with them itself.


2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Its not been used in large scale by us but according YARN-2556 (ATS Performance Test Tool), it states that "On a 36 node cluster, this results in ~830 concurrent containers (e.g maps), each firing 10KB of payload, 20 times." but only thing being different is, data in your system is already overloaded hence cost of querying (which is currently happening during each insertion) is very high.

May be guys from other company who have used or supported ATSV1 might be able to tell the ATSV1 scale better !


Regards,

+ Naga

________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Thursday, November 05, 2015 19:51
To: user@hadoop.apache.org
Subject: Re: YARN timelineserver process taking 600% CPU

Thanks Naga for your input,  (I'm sorry for a late response, I was out for some time).

So you believe that Spark is actually doing the PUTs? There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely is doing what or how to get some logs about it, I don't know...
I though maybe it is Spark History Server doing the puts, but it seems it is not, as I disabled it and the load hasn't gone down. So it seems these are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least workaround this issue? Maybe I will somehow disable Spark usage of Yarn timelineserver ? What are the consequences? Is it only history of Spark finished jobs not being saved? If yes, that doesn't hurt that much. Probably this is a question to Spark group...
2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a small cluster, how other larger companies are handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than puts seems like around 16 concurrent puts were happening which tries to get the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and if all of these are for puts then roughly about 10 put calls/s is happening from spark side. This i feel is not right usage of ATS, can you check what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in trunk as part of YARN-3360 which could have been useful in analyzing your issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>>:
Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>> wrote:
Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for issue with it. But it was installed about a 2 months ago. Just few applications (1?2? ) has been started since its last start. The ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>

Ah, One more thing: When I looked with jstack to see what the process is doing, I saw threads spending time in NATIVE in leveldbjni library. So I *think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <ga...@huawei.com>>:
Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Thanks Naga for your input,  (I'm sorry for a late response, I was out for
some time).

So you believe that Spark is actually doing the PUTs? There are currently 8
Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10
s. I believe these are the jobs that publish to ATS.  How could I check
what precisely is doing what or how to get some logs about it, I don't
know...
I though maybe it is Spark History Server doing the puts, but it seems it
is not, as I disabled it and the load hasn't gone down. So it seems these
are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least *workaround* this issue? Maybe I
will somehow disable Spark usage of Yarn timelineserver ? What are the
consequences? Is it only history of Spark finished jobs not being saved? If
yes, that doesn't hurt that much. Probably this is a question to Spark
group...
2. Is 8 concurrent Spark Streaming jobs really that high for
Timelineserver? I have just a small cluster, how other larger companies are
handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> Oops My mistake, 3 Gb seems to be on little higher side.
> And from the jstack it seems like there were no major activity other than
> puts seems like around 16 concurrent puts were happening which tries to get
> the timeline Entity hence hitting the native call.
>
> From the logs it seems like lot of ACL validations are happening and from
> the URL it seems like its for PUTEntites.
> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
> and if all of these are for puts then roughly about 10 put calls/s is
> happening from *spark* side. This i feel is not right usage of ATS, can
> you check what is being published from the spark to ATS at this high rate ?
>
> Besides some improvements regarding the timeline metrics is available in
> trunk as part of YARN-3360 which could have been useful in analyzing your
> issue.
>
> + Naga
>
>
> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
> wrote:
>
>> Hi Naga,
>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>> numbers in kB). Does that seems reasonable as well?
>> There are new .sst files generated each minute.
>> There are now 26850 files in leveldb-timeline-store directory. New files
>> are generated each minute. Some are also being deleted.
>>
>> I started timeline server today, to gather logs and jstack, it was
>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>
>> Thank you for helping me debug this.
>> Krzysiek
>>
>>
>>
>>
>>
>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> seems like the size is around 3 MB which seems to be fine. ,
>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>> if possible the jstack output for the AHS process
>>>
>>> + Naga
>>>
>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>> k.zarzycki@gmail.com> wrote:
>>>
>>>> Hi Naga,
>>>> I see the following size:
>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>
>>>> The timeline service has been multiple times restarted as I was looking
>>>> for issue with it. But it was installed about a 2 months ago. Just few
>>>> applications (1?2? ) has been started since its last start. The
>>>> ResourceManager interface has 261 entries.
>>>>
>>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>>> the following value:
>>>>
>>>> <property>
>>>>
>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>       <value>300000</value>
>>>> </property>
>>>>
>>>>
>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>> *think* it is related to leveldb store.
>>>>
>>>> Please ask if any more information is needed.
>>>> Any help is appreciated! Thanks
>>>> Krzysiek
>>>>
>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>> garlanaganarasimha@huawei.com>:
>>>>
>>>>> Hi ,
>>>>>
>>>>> Whats the size of Store Files?
>>>>> Since when is it running ? how many applications have been run since
>>>>> it has been started ?
>>>>> Whats the value of "
>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>
>>>>> + Naga
>>>>> ------------------------------
>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>
>>>>> Hi there Hadoopers,
>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>> version 2.7.1 (HDP 2.3).
>>>>> The timelineserver process ( more
>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>> I can't guess why it happens.
>>>>>
>>>>> First, I run the timelineserver using java 8, thought that this was an
>>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>>> the problem is the same.
>>>>>
>>>>> My cluster is tiny- it consists of:
>>>>> - 2 HDFS nodes
>>>>> - 2 HBase RegionServers
>>>>> - 2 Kafkas
>>>>> - 2 Spark nodes
>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>>
>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>> please write.
>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>> server.
>>>>>
>>>>> And here is a command of timeline that I see from ps :
>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -classpath
>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Krzysztof
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Thanks Naga for your input,  (I'm sorry for a late response, I was out for
some time).

So you believe that Spark is actually doing the PUTs? There are currently 8
Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10
s. I believe these are the jobs that publish to ATS.  How could I check
what precisely is doing what or how to get some logs about it, I don't
know...
I though maybe it is Spark History Server doing the puts, but it seems it
is not, as I disabled it and the load hasn't gone down. So it seems these
are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least *workaround* this issue? Maybe I
will somehow disable Spark usage of Yarn timelineserver ? What are the
consequences? Is it only history of Spark finished jobs not being saved? If
yes, that doesn't hurt that much. Probably this is a question to Spark
group...
2. Is 8 concurrent Spark Streaming jobs really that high for
Timelineserver? I have just a small cluster, how other larger companies are
handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> Oops My mistake, 3 Gb seems to be on little higher side.
> And from the jstack it seems like there were no major activity other than
> puts seems like around 16 concurrent puts were happening which tries to get
> the timeline Entity hence hitting the native call.
>
> From the logs it seems like lot of ACL validations are happening and from
> the URL it seems like its for PUTEntites.
> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
> and if all of these are for puts then roughly about 10 put calls/s is
> happening from *spark* side. This i feel is not right usage of ATS, can
> you check what is being published from the spark to ATS at this high rate ?
>
> Besides some improvements regarding the timeline metrics is available in
> trunk as part of YARN-3360 which could have been useful in analyzing your
> issue.
>
> + Naga
>
>
> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
> wrote:
>
>> Hi Naga,
>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>> numbers in kB). Does that seems reasonable as well?
>> There are new .sst files generated each minute.
>> There are now 26850 files in leveldb-timeline-store directory. New files
>> are generated each minute. Some are also being deleted.
>>
>> I started timeline server today, to gather logs and jstack, it was
>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>
>> Thank you for helping me debug this.
>> Krzysiek
>>
>>
>>
>>
>>
>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> seems like the size is around 3 MB which seems to be fine. ,
>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>> if possible the jstack output for the AHS process
>>>
>>> + Naga
>>>
>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>> k.zarzycki@gmail.com> wrote:
>>>
>>>> Hi Naga,
>>>> I see the following size:
>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>
>>>> The timeline service has been multiple times restarted as I was looking
>>>> for issue with it. But it was installed about a 2 months ago. Just few
>>>> applications (1?2? ) has been started since its last start. The
>>>> ResourceManager interface has 261 entries.
>>>>
>>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>>> the following value:
>>>>
>>>> <property>
>>>>
>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>       <value>300000</value>
>>>> </property>
>>>>
>>>>
>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>> *think* it is related to leveldb store.
>>>>
>>>> Please ask if any more information is needed.
>>>> Any help is appreciated! Thanks
>>>> Krzysiek
>>>>
>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>> garlanaganarasimha@huawei.com>:
>>>>
>>>>> Hi ,
>>>>>
>>>>> Whats the size of Store Files?
>>>>> Since when is it running ? how many applications have been run since
>>>>> it has been started ?
>>>>> Whats the value of "
>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>
>>>>> + Naga
>>>>> ------------------------------
>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>
>>>>> Hi there Hadoopers,
>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>> version 2.7.1 (HDP 2.3).
>>>>> The timelineserver process ( more
>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>> I can't guess why it happens.
>>>>>
>>>>> First, I run the timelineserver using java 8, thought that this was an
>>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>>> the problem is the same.
>>>>>
>>>>> My cluster is tiny- it consists of:
>>>>> - 2 HDFS nodes
>>>>> - 2 HBase RegionServers
>>>>> - 2 Kafkas
>>>>> - 2 Spark nodes
>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>>
>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>> please write.
>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>> server.
>>>>>
>>>>> And here is a command of timeline that I see from ps :
>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -classpath
>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Krzysztof
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Thanks Naga for your input,  (I'm sorry for a late response, I was out for
some time).

So you believe that Spark is actually doing the PUTs? There are currently 8
Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10
s. I believe these are the jobs that publish to ATS.  How could I check
what precisely is doing what or how to get some logs about it, I don't
know...
I though maybe it is Spark History Server doing the puts, but it seems it
is not, as I disabled it and the load hasn't gone down. So it seems these
are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least *workaround* this issue? Maybe I
will somehow disable Spark usage of Yarn timelineserver ? What are the
consequences? Is it only history of Spark finished jobs not being saved? If
yes, that doesn't hurt that much. Probably this is a question to Spark
group...
2. Is 8 concurrent Spark Streaming jobs really that high for
Timelineserver? I have just a small cluster, how other larger companies are
handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> Oops My mistake, 3 Gb seems to be on little higher side.
> And from the jstack it seems like there were no major activity other than
> puts seems like around 16 concurrent puts were happening which tries to get
> the timeline Entity hence hitting the native call.
>
> From the logs it seems like lot of ACL validations are happening and from
> the URL it seems like its for PUTEntites.
> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
> and if all of these are for puts then roughly about 10 put calls/s is
> happening from *spark* side. This i feel is not right usage of ATS, can
> you check what is being published from the spark to ATS at this high rate ?
>
> Besides some improvements regarding the timeline metrics is available in
> trunk as part of YARN-3360 which could have been useful in analyzing your
> issue.
>
> + Naga
>
>
> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
> wrote:
>
>> Hi Naga,
>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>> numbers in kB). Does that seems reasonable as well?
>> There are new .sst files generated each minute.
>> There are now 26850 files in leveldb-timeline-store directory. New files
>> are generated each minute. Some are also being deleted.
>>
>> I started timeline server today, to gather logs and jstack, it was
>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>
>> Thank you for helping me debug this.
>> Krzysiek
>>
>>
>>
>>
>>
>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> seems like the size is around 3 MB which seems to be fine. ,
>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>> if possible the jstack output for the AHS process
>>>
>>> + Naga
>>>
>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>> k.zarzycki@gmail.com> wrote:
>>>
>>>> Hi Naga,
>>>> I see the following size:
>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>
>>>> The timeline service has been multiple times restarted as I was looking
>>>> for issue with it. But it was installed about a 2 months ago. Just few
>>>> applications (1?2? ) has been started since its last start. The
>>>> ResourceManager interface has 261 entries.
>>>>
>>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>>> the following value:
>>>>
>>>> <property>
>>>>
>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>       <value>300000</value>
>>>> </property>
>>>>
>>>>
>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>> *think* it is related to leveldb store.
>>>>
>>>> Please ask if any more information is needed.
>>>> Any help is appreciated! Thanks
>>>> Krzysiek
>>>>
>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>> garlanaganarasimha@huawei.com>:
>>>>
>>>>> Hi ,
>>>>>
>>>>> Whats the size of Store Files?
>>>>> Since when is it running ? how many applications have been run since
>>>>> it has been started ?
>>>>> Whats the value of "
>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>
>>>>> + Naga
>>>>> ------------------------------
>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>
>>>>> Hi there Hadoopers,
>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>> version 2.7.1 (HDP 2.3).
>>>>> The timelineserver process ( more
>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>> I can't guess why it happens.
>>>>>
>>>>> First, I run the timelineserver using java 8, thought that this was an
>>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>>> the problem is the same.
>>>>>
>>>>> My cluster is tiny- it consists of:
>>>>> - 2 HDFS nodes
>>>>> - 2 HBase RegionServers
>>>>> - 2 Kafkas
>>>>> - 2 Spark nodes
>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>>
>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>> please write.
>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>> server.
>>>>>
>>>>> And here is a command of timeline that I see from ps :
>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -classpath
>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Krzysztof
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Thanks Naga for your input,  (I'm sorry for a late response, I was out for
some time).

So you believe that Spark is actually doing the PUTs? There are currently 8
Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x 10
s. I believe these are the jobs that publish to ATS.  How could I check
what precisely is doing what or how to get some logs about it, I don't
know...
I though maybe it is Spark History Server doing the puts, but it seems it
is not, as I disabled it and the load hasn't gone down. So it seems these
are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least *workaround* this issue? Maybe I
will somehow disable Spark usage of Yarn timelineserver ? What are the
consequences? Is it only history of Spark finished jobs not being saved? If
yes, that doesn't hurt that much. Probably this is a question to Spark
group...
2. Is 8 concurrent Spark Streaming jobs really that high for
Timelineserver? I have just a small cluster, how other larger companies are
handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> Oops My mistake, 3 Gb seems to be on little higher side.
> And from the jstack it seems like there were no major activity other than
> puts seems like around 16 concurrent puts were happening which tries to get
> the timeline Entity hence hitting the native call.
>
> From the logs it seems like lot of ACL validations are happening and from
> the URL it seems like its for PUTEntites.
> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
> and if all of these are for puts then roughly about 10 put calls/s is
> happening from *spark* side. This i feel is not right usage of ATS, can
> you check what is being published from the spark to ATS at this high rate ?
>
> Besides some improvements regarding the timeline metrics is available in
> trunk as part of YARN-3360 which could have been useful in analyzing your
> issue.
>
> + Naga
>
>
> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
> wrote:
>
>> Hi Naga,
>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>> numbers in kB). Does that seems reasonable as well?
>> There are new .sst files generated each minute.
>> There are now 26850 files in leveldb-timeline-store directory. New files
>> are generated each minute. Some are also being deleted.
>>
>> I started timeline server today, to gather logs and jstack, it was
>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>
>> Thank you for helping me debug this.
>> Krzysiek
>>
>>
>>
>>
>>
>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> seems like the size is around 3 MB which seems to be fine. ,
>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>> if possible the jstack output for the AHS process
>>>
>>> + Naga
>>>
>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>> k.zarzycki@gmail.com> wrote:
>>>
>>>> Hi Naga,
>>>> I see the following size:
>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>
>>>> The timeline service has been multiple times restarted as I was looking
>>>> for issue with it. But it was installed about a 2 months ago. Just few
>>>> applications (1?2? ) has been started since its last start. The
>>>> ResourceManager interface has 261 entries.
>>>>
>>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>>> the following value:
>>>>
>>>> <property>
>>>>
>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>       <value>300000</value>
>>>> </property>
>>>>
>>>>
>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>> *think* it is related to leveldb store.
>>>>
>>>> Please ask if any more information is needed.
>>>> Any help is appreciated! Thanks
>>>> Krzysiek
>>>>
>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>> garlanaganarasimha@huawei.com>:
>>>>
>>>>> Hi ,
>>>>>
>>>>> Whats the size of Store Files?
>>>>> Since when is it running ? how many applications have been run since
>>>>> it has been started ?
>>>>> Whats the value of "
>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>
>>>>> + Naga
>>>>> ------------------------------
>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>
>>>>> Hi there Hadoopers,
>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>> version 2.7.1 (HDP 2.3).
>>>>> The timelineserver process ( more
>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>> I can't guess why it happens.
>>>>>
>>>>> First, I run the timelineserver using java 8, thought that this was an
>>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>>> the problem is the same.
>>>>>
>>>>> My cluster is tiny- it consists of:
>>>>> - 2 HDFS nodes
>>>>> - 2 HBase RegionServers
>>>>> - 2 Kafkas
>>>>> - 2 Spark nodes
>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>>
>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>> please write.
>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>> server.
>>>>>
>>>>> And here is a command of timeline that I see from ps :
>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>> -classpath
>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Krzysztof
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than
puts seems like around 16 concurrent puts were happening which tries to get
the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from
the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and
if all of these are for puts then roughly about 10 put calls/s is happening
from *spark* side. This i feel is not right usage of ATS, can you check
what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in
trunk as part of YARN-3360 which could have been useful in analyzing your
issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
> numbers in kB). Does that seems reasonable as well?
> There are new .sst files generated each minute.
> There are now 26850 files in leveldb-timeline-store directory. New files
> are generated each minute. Some are also being deleted.
>
> I started timeline server today, to gather logs and jstack, it was running
> for ~20 minutes. I attach the tar bz2 archive with those logs.
>
> Thank you for helping me debug this.
> Krzysiek
>
>
>
>
>
> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> seems like the size is around 3 MB which seems to be fine. ,
>> Could you try enabling in debug and share the logs of ATS/AHS and also if
>> possible the jstack output for the AHS process
>>
>> + Naga
>>
>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>> k.zarzycki@gmail.com> wrote:
>>
>>> Hi Naga,
>>> I see the following size:
>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>
>>> The timeline service has been multiple times restarted as I was looking
>>> for issue with it. But it was installed about a 2 months ago. Just few
>>> applications (1?2? ) has been started since its last start. The
>>> ResourceManager interface has 261 entries.
>>>
>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>> the following value:
>>>
>>> <property>
>>>
>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>       <value>300000</value>
>>> </property>
>>>
>>>
>>> Ah, One more thing: When I looked with jstack to see what the process is
>>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>> *think* it is related to leveldb store.
>>>
>>> Please ask if any more information is needed.
>>> Any help is appreciated! Thanks
>>> Krzysiek
>>>
>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>> garlanaganarasimha@huawei.com>:
>>>
>>>> Hi ,
>>>>
>>>> Whats the size of Store Files?
>>>> Since when is it running ? how many applications have been run since it
>>>> has been started ?
>>>> Whats the value of "
>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>
>>>> + Naga
>>>> ------------------------------
>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>
>>>> Hi there Hadoopers,
>>>> I have a serious issue with my installation of Hadoop & YARN in version
>>>> 2.7.1 (HDP 2.3).
>>>> The timelineserver process ( more
>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>> I can't guess why it happens.
>>>>
>>>> First, I run the timelineserver using java 8, thought that this was an
>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>> the problem is the same.
>>>>
>>>> My cluster is tiny- it consists of:
>>>> - 2 HDFS nodes
>>>> - 2 HBase RegionServers
>>>> - 2 Kafkas
>>>> - 2 Spark nodes
>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>
>>>> I'll be very grateful for your help here. If you need any more info,
>>>> please write.
>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>> server.
>>>>
>>>> And here is a command of timeline that I see from ps :
>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -classpath
>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>
>>>>
>>>> Thanks!
>>>> Krzysztof
>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than
puts seems like around 16 concurrent puts were happening which tries to get
the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from
the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and
if all of these are for puts then roughly about 10 put calls/s is happening
from *spark* side. This i feel is not right usage of ATS, can you check
what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in
trunk as part of YARN-3360 which could have been useful in analyzing your
issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
> numbers in kB). Does that seems reasonable as well?
> There are new .sst files generated each minute.
> There are now 26850 files in leveldb-timeline-store directory. New files
> are generated each minute. Some are also being deleted.
>
> I started timeline server today, to gather logs and jstack, it was running
> for ~20 minutes. I attach the tar bz2 archive with those logs.
>
> Thank you for helping me debug this.
> Krzysiek
>
>
>
>
>
> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> seems like the size is around 3 MB which seems to be fine. ,
>> Could you try enabling in debug and share the logs of ATS/AHS and also if
>> possible the jstack output for the AHS process
>>
>> + Naga
>>
>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>> k.zarzycki@gmail.com> wrote:
>>
>>> Hi Naga,
>>> I see the following size:
>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>
>>> The timeline service has been multiple times restarted as I was looking
>>> for issue with it. But it was installed about a 2 months ago. Just few
>>> applications (1?2? ) has been started since its last start. The
>>> ResourceManager interface has 261 entries.
>>>
>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>> the following value:
>>>
>>> <property>
>>>
>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>       <value>300000</value>
>>> </property>
>>>
>>>
>>> Ah, One more thing: When I looked with jstack to see what the process is
>>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>> *think* it is related to leveldb store.
>>>
>>> Please ask if any more information is needed.
>>> Any help is appreciated! Thanks
>>> Krzysiek
>>>
>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>> garlanaganarasimha@huawei.com>:
>>>
>>>> Hi ,
>>>>
>>>> Whats the size of Store Files?
>>>> Since when is it running ? how many applications have been run since it
>>>> has been started ?
>>>> Whats the value of "
>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>
>>>> + Naga
>>>> ------------------------------
>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>
>>>> Hi there Hadoopers,
>>>> I have a serious issue with my installation of Hadoop & YARN in version
>>>> 2.7.1 (HDP 2.3).
>>>> The timelineserver process ( more
>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>> I can't guess why it happens.
>>>>
>>>> First, I run the timelineserver using java 8, thought that this was an
>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>> the problem is the same.
>>>>
>>>> My cluster is tiny- it consists of:
>>>> - 2 HDFS nodes
>>>> - 2 HBase RegionServers
>>>> - 2 Kafkas
>>>> - 2 Spark nodes
>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>
>>>> I'll be very grateful for your help here. If you need any more info,
>>>> please write.
>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>> server.
>>>>
>>>> And here is a command of timeline that I see from ps :
>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -classpath
>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>
>>>>
>>>> Thanks!
>>>> Krzysztof
>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than
puts seems like around 16 concurrent puts were happening which tries to get
the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from
the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and
if all of these are for puts then roughly about 10 put calls/s is happening
from *spark* side. This i feel is not right usage of ATS, can you check
what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in
trunk as part of YARN-3360 which could have been useful in analyzing your
issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
> numbers in kB). Does that seems reasonable as well?
> There are new .sst files generated each minute.
> There are now 26850 files in leveldb-timeline-store directory. New files
> are generated each minute. Some are also being deleted.
>
> I started timeline server today, to gather logs and jstack, it was running
> for ~20 minutes. I attach the tar bz2 archive with those logs.
>
> Thank you for helping me debug this.
> Krzysiek
>
>
>
>
>
> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> seems like the size is around 3 MB which seems to be fine. ,
>> Could you try enabling in debug and share the logs of ATS/AHS and also if
>> possible the jstack output for the AHS process
>>
>> + Naga
>>
>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>> k.zarzycki@gmail.com> wrote:
>>
>>> Hi Naga,
>>> I see the following size:
>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>
>>> The timeline service has been multiple times restarted as I was looking
>>> for issue with it. But it was installed about a 2 months ago. Just few
>>> applications (1?2? ) has been started since its last start. The
>>> ResourceManager interface has 261 entries.
>>>
>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>> the following value:
>>>
>>> <property>
>>>
>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>       <value>300000</value>
>>> </property>
>>>
>>>
>>> Ah, One more thing: When I looked with jstack to see what the process is
>>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>> *think* it is related to leveldb store.
>>>
>>> Please ask if any more information is needed.
>>> Any help is appreciated! Thanks
>>> Krzysiek
>>>
>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>> garlanaganarasimha@huawei.com>:
>>>
>>>> Hi ,
>>>>
>>>> Whats the size of Store Files?
>>>> Since when is it running ? how many applications have been run since it
>>>> has been started ?
>>>> Whats the value of "
>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>
>>>> + Naga
>>>> ------------------------------
>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>
>>>> Hi there Hadoopers,
>>>> I have a serious issue with my installation of Hadoop & YARN in version
>>>> 2.7.1 (HDP 2.3).
>>>> The timelineserver process ( more
>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>> I can't guess why it happens.
>>>>
>>>> First, I run the timelineserver using java 8, thought that this was an
>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>> the problem is the same.
>>>>
>>>> My cluster is tiny- it consists of:
>>>> - 2 HDFS nodes
>>>> - 2 HBase RegionServers
>>>> - 2 Kafkas
>>>> - 2 Spark nodes
>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>
>>>> I'll be very grateful for your help here. If you need any more info,
>>>> please write.
>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>> server.
>>>>
>>>> And here is a command of timeline that I see from ps :
>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -classpath
>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>
>>>>
>>>> Thanks!
>>>> Krzysztof
>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than
puts seems like around 16 concurrent puts were happening which tries to get
the timeline Entity hence hitting the native call.

>From the logs it seems like lot of ACL validations are happening and from
the URL it seems like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and
if all of these are for puts then roughly about 10 put calls/s is happening
from *spark* side. This i feel is not right usage of ATS, can you check
what is being published from the spark to ATS at this high rate ?

Besides some improvements regarding the timeline metrics is available in
trunk as part of YARN-3360 which could have been useful in analyzing your
issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
> numbers in kB). Does that seems reasonable as well?
> There are new .sst files generated each minute.
> There are now 26850 files in leveldb-timeline-store directory. New files
> are generated each minute. Some are also being deleted.
>
> I started timeline server today, to gather logs and jstack, it was running
> for ~20 minutes. I attach the tar bz2 archive with those logs.
>
> Thank you for helping me debug this.
> Krzysiek
>
>
>
>
>
> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> seems like the size is around 3 MB which seems to be fine. ,
>> Could you try enabling in debug and share the logs of ATS/AHS and also if
>> possible the jstack output for the AHS process
>>
>> + Naga
>>
>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>> k.zarzycki@gmail.com> wrote:
>>
>>> Hi Naga,
>>> I see the following size:
>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>
>>> The timeline service has been multiple times restarted as I was looking
>>> for issue with it. But it was installed about a 2 months ago. Just few
>>> applications (1?2? ) has been started since its last start. The
>>> ResourceManager interface has 261 entries.
>>>
>>> As in yarn-site.xml that I attached, the variable you're asking for has
>>> the following value:
>>>
>>> <property>
>>>
>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>       <value>300000</value>
>>> </property>
>>>
>>>
>>> Ah, One more thing: When I looked with jstack to see what the process is
>>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>> *think* it is related to leveldb store.
>>>
>>> Please ask if any more information is needed.
>>> Any help is appreciated! Thanks
>>> Krzysiek
>>>
>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>> garlanaganarasimha@huawei.com>:
>>>
>>>> Hi ,
>>>>
>>>> Whats the size of Store Files?
>>>> Since when is it running ? how many applications have been run since it
>>>> has been started ?
>>>> Whats the value of "
>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>
>>>> + Naga
>>>> ------------------------------
>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>
>>>> Hi there Hadoopers,
>>>> I have a serious issue with my installation of Hadoop & YARN in version
>>>> 2.7.1 (HDP 2.3).
>>>> The timelineserver process ( more
>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>> I can't guess why it happens.
>>>>
>>>> First, I run the timelineserver using java 8, thought that this was an
>>>> issue. But no, I started timelineserver now with use of java 7 and still
>>>> the problem is the same.
>>>>
>>>> My cluster is tiny- it consists of:
>>>> - 2 HDFS nodes
>>>> - 2 HBase RegionServers
>>>> - 2 Kafkas
>>>> - 2 Spark nodes
>>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>>
>>>> I'll be very grateful for your help here. If you need any more info,
>>>> please write.
>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>> server.
>>>>
>>>> And here is a command of timeline that I see from ps :
>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>> -classpath
>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>
>>>>
>>>> Thanks!
>>>> Krzysztof
>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files
are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running
for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> seems like the size is around 3 MB which seems to be fine. ,
> Could you try enabling in debug and share the logs of ATS/AHS and also if
> possible the jstack output for the AHS process
>
> + Naga
>
> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
> > wrote:
>
>> Hi Naga,
>> I see the following size:
>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>> 3307812 /var/lib/hadoop/yarn/timeline
>>
>> The timeline service has been multiple times restarted as I was looking
>> for issue with it. But it was installed about a 2 months ago. Just few
>> applications (1?2? ) has been started since its last start. The
>> ResourceManager interface has 261 entries.
>>
>> As in yarn-site.xml that I attached, the variable you're asking for has
>> the following value:
>>
>> <property>
>>
>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>       <value>300000</value>
>> </property>
>>
>>
>> Ah, One more thing: When I looked with jstack to see what the process is
>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>> *think* it is related to leveldb store.
>>
>> Please ask if any more information is needed.
>> Any help is appreciated! Thanks
>> Krzysiek
>>
>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>> garlanaganarasimha@huawei.com>:
>>
>>> Hi ,
>>>
>>> Whats the size of Store Files?
>>> Since when is it running ? how many applications have been run since it
>>> has been started ?
>>> Whats the value of "
>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>
>>> + Naga
>>> ------------------------------
>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>> *Sent:* Wednesday, September 30, 2015 19:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>
>>> Hi there Hadoopers,
>>> I have a serious issue with my installation of Hadoop & YARN in version
>>> 2.7.1 (HDP 2.3).
>>> The timelineserver process ( more
>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>> I can't guess why it happens.
>>>
>>> First, I run the timelineserver using java 8, thought that this was an
>>> issue. But no, I started timelineserver now with use of java 7 and still
>>> the problem is the same.
>>>
>>> My cluster is tiny- it consists of:
>>> - 2 HDFS nodes
>>> - 2 HBase RegionServers
>>> - 2 Kafkas
>>> - 2 Spark nodes
>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>
>>> I'll be very grateful for your help here. If you need any more info,
>>> please write.
>>> I also attach yarn-site.xml grepped to options related to timeline
>>> server.
>>>
>>> And here is a command of timeline that I see from ps :
>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -Dyarn.policy.file=hadoop-policy.xml
>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -classpath
>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>
>>>
>>> Thanks!
>>> Krzysztof
>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files
are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running
for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> seems like the size is around 3 MB which seems to be fine. ,
> Could you try enabling in debug and share the logs of ATS/AHS and also if
> possible the jstack output for the AHS process
>
> + Naga
>
> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
> > wrote:
>
>> Hi Naga,
>> I see the following size:
>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>> 3307812 /var/lib/hadoop/yarn/timeline
>>
>> The timeline service has been multiple times restarted as I was looking
>> for issue with it. But it was installed about a 2 months ago. Just few
>> applications (1?2? ) has been started since its last start. The
>> ResourceManager interface has 261 entries.
>>
>> As in yarn-site.xml that I attached, the variable you're asking for has
>> the following value:
>>
>> <property>
>>
>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>       <value>300000</value>
>> </property>
>>
>>
>> Ah, One more thing: When I looked with jstack to see what the process is
>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>> *think* it is related to leveldb store.
>>
>> Please ask if any more information is needed.
>> Any help is appreciated! Thanks
>> Krzysiek
>>
>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>> garlanaganarasimha@huawei.com>:
>>
>>> Hi ,
>>>
>>> Whats the size of Store Files?
>>> Since when is it running ? how many applications have been run since it
>>> has been started ?
>>> Whats the value of "
>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>
>>> + Naga
>>> ------------------------------
>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>> *Sent:* Wednesday, September 30, 2015 19:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>
>>> Hi there Hadoopers,
>>> I have a serious issue with my installation of Hadoop & YARN in version
>>> 2.7.1 (HDP 2.3).
>>> The timelineserver process ( more
>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>> I can't guess why it happens.
>>>
>>> First, I run the timelineserver using java 8, thought that this was an
>>> issue. But no, I started timelineserver now with use of java 7 and still
>>> the problem is the same.
>>>
>>> My cluster is tiny- it consists of:
>>> - 2 HDFS nodes
>>> - 2 HBase RegionServers
>>> - 2 Kafkas
>>> - 2 Spark nodes
>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>
>>> I'll be very grateful for your help here. If you need any more info,
>>> please write.
>>> I also attach yarn-site.xml grepped to options related to timeline
>>> server.
>>>
>>> And here is a command of timeline that I see from ps :
>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -Dyarn.policy.file=hadoop-policy.xml
>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -classpath
>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>
>>>
>>> Thanks!
>>> Krzysztof
>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files
are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running
for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> seems like the size is around 3 MB which seems to be fine. ,
> Could you try enabling in debug and share the logs of ATS/AHS and also if
> possible the jstack output for the AHS process
>
> + Naga
>
> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
> > wrote:
>
>> Hi Naga,
>> I see the following size:
>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>> 3307812 /var/lib/hadoop/yarn/timeline
>>
>> The timeline service has been multiple times restarted as I was looking
>> for issue with it. But it was installed about a 2 months ago. Just few
>> applications (1?2? ) has been started since its last start. The
>> ResourceManager interface has 261 entries.
>>
>> As in yarn-site.xml that I attached, the variable you're asking for has
>> the following value:
>>
>> <property>
>>
>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>       <value>300000</value>
>> </property>
>>
>>
>> Ah, One more thing: When I looked with jstack to see what the process is
>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>> *think* it is related to leveldb store.
>>
>> Please ask if any more information is needed.
>> Any help is appreciated! Thanks
>> Krzysiek
>>
>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>> garlanaganarasimha@huawei.com>:
>>
>>> Hi ,
>>>
>>> Whats the size of Store Files?
>>> Since when is it running ? how many applications have been run since it
>>> has been started ?
>>> Whats the value of "
>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>
>>> + Naga
>>> ------------------------------
>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>> *Sent:* Wednesday, September 30, 2015 19:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>
>>> Hi there Hadoopers,
>>> I have a serious issue with my installation of Hadoop & YARN in version
>>> 2.7.1 (HDP 2.3).
>>> The timelineserver process ( more
>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>> I can't guess why it happens.
>>>
>>> First, I run the timelineserver using java 8, thought that this was an
>>> issue. But no, I started timelineserver now with use of java 7 and still
>>> the problem is the same.
>>>
>>> My cluster is tiny- it consists of:
>>> - 2 HDFS nodes
>>> - 2 HBase RegionServers
>>> - 2 Kafkas
>>> - 2 Spark nodes
>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>
>>> I'll be very grateful for your help here. If you need any more info,
>>> please write.
>>> I also attach yarn-site.xml grepped to options related to timeline
>>> server.
>>>
>>> And here is a command of timeline that I see from ps :
>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -Dyarn.policy.file=hadoop-policy.xml
>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -classpath
>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>
>>>
>>> Thanks!
>>> Krzysztof
>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
numbers in kB). Does that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files
are generated each minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running
for ~20 minutes. I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <na...@gmail.com>:

> Hi Krzysiek,
> seems like the size is around 3 MB which seems to be fine. ,
> Could you try enabling in debug and share the logs of ATS/AHS and also if
> possible the jstack output for the AHS process
>
> + Naga
>
> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
> > wrote:
>
>> Hi Naga,
>> I see the following size:
>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>> 3307812 /var/lib/hadoop/yarn/timeline
>>
>> The timeline service has been multiple times restarted as I was looking
>> for issue with it. But it was installed about a 2 months ago. Just few
>> applications (1?2? ) has been started since its last start. The
>> ResourceManager interface has 261 entries.
>>
>> As in yarn-site.xml that I attached, the variable you're asking for has
>> the following value:
>>
>> <property>
>>
>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>       <value>300000</value>
>> </property>
>>
>>
>> Ah, One more thing: When I looked with jstack to see what the process is
>> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>> *think* it is related to leveldb store.
>>
>> Please ask if any more information is needed.
>> Any help is appreciated! Thanks
>> Krzysiek
>>
>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>> garlanaganarasimha@huawei.com>:
>>
>>> Hi ,
>>>
>>> Whats the size of Store Files?
>>> Since when is it running ? how many applications have been run since it
>>> has been started ?
>>> Whats the value of "
>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>
>>> + Naga
>>> ------------------------------
>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>> *Sent:* Wednesday, September 30, 2015 19:20
>>> *To:* user@hadoop.apache.org
>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>
>>> Hi there Hadoopers,
>>> I have a serious issue with my installation of Hadoop & YARN in version
>>> 2.7.1 (HDP 2.3).
>>> The timelineserver process ( more
>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>> I can't guess why it happens.
>>>
>>> First, I run the timelineserver using java 8, thought that this was an
>>> issue. But no, I started timelineserver now with use of java 7 and still
>>> the problem is the same.
>>>
>>> My cluster is tiny- it consists of:
>>> - 2 HDFS nodes
>>> - 2 HBase RegionServers
>>> - 2 Kafkas
>>> - 2 Spark nodes
>>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>>
>>> I'll be very grateful for your help here. If you need any more info,
>>> please write.
>>> I also attach yarn-site.xml grepped to options related to timeline
>>> server.
>>>
>>> And here is a command of timeline that I see from ps :
>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -Dyarn.policy.file=hadoop-policy.xml
>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>> -classpath
>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>
>>>
>>> Thanks!
>>> Krzysztof
>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if
possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> I see the following size:
> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
> 3307812 /var/lib/hadoop/yarn/timeline
>
> The timeline service has been multiple times restarted as I was looking
> for issue with it. But it was installed about a 2 months ago. Just few
> applications (1?2? ) has been started since its last start. The
> ResourceManager interface has 261 entries.
>
> As in yarn-site.xml that I attached, the variable you're asking for has
> the following value:
>
> <property>
>
>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>       <value>300000</value>
> </property>
>
>
> Ah, One more thing: When I looked with jstack to see what the process is
> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
> *think* it is related to leveldb store.
>
> Please ask if any more information is needed.
> Any help is appreciated! Thanks
> Krzysiek
>
> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi ,
>>
>> Whats the size of Store Files?
>> Since when is it running ? how many applications have been run since it
>> has been started ?
>> Whats the value of "
>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Wednesday, September 30, 2015 19:20
>> *To:* user@hadoop.apache.org
>> *Subject:* YARN timelineserver process taking 600% CPU
>>
>> Hi there Hadoopers,
>> I have a serious issue with my installation of Hadoop & YARN in version
>> 2.7.1 (HDP 2.3).
>> The timelineserver process ( more
>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>> class) takes over 600% of CPU, generating enormous load on my master node.
>> I can't guess why it happens.
>>
>> First, I run the timelineserver using java 8, thought that this was an
>> issue. But no, I started timelineserver now with use of java 7 and still
>> the problem is the same.
>>
>> My cluster is tiny- it consists of:
>> - 2 HDFS nodes
>> - 2 HBase RegionServers
>> - 2 Kafkas
>> - 2 Spark nodes
>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>
>> I'll be very grateful for your help here. If you need any more info,
>> please write.
>> I also attach yarn-site.xml grepped to options related to timeline server.
>>
>> And here is a command of timeline that I see from ps :
>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>> -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -Dyarn.policy.file=hadoop-policy.xml
>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -classpath
>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>
>>
>> Thanks!
>> Krzysztof
>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if
possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> I see the following size:
> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
> 3307812 /var/lib/hadoop/yarn/timeline
>
> The timeline service has been multiple times restarted as I was looking
> for issue with it. But it was installed about a 2 months ago. Just few
> applications (1?2? ) has been started since its last start. The
> ResourceManager interface has 261 entries.
>
> As in yarn-site.xml that I attached, the variable you're asking for has
> the following value:
>
> <property>
>
>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>       <value>300000</value>
> </property>
>
>
> Ah, One more thing: When I looked with jstack to see what the process is
> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
> *think* it is related to leveldb store.
>
> Please ask if any more information is needed.
> Any help is appreciated! Thanks
> Krzysiek
>
> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi ,
>>
>> Whats the size of Store Files?
>> Since when is it running ? how many applications have been run since it
>> has been started ?
>> Whats the value of "
>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Wednesday, September 30, 2015 19:20
>> *To:* user@hadoop.apache.org
>> *Subject:* YARN timelineserver process taking 600% CPU
>>
>> Hi there Hadoopers,
>> I have a serious issue with my installation of Hadoop & YARN in version
>> 2.7.1 (HDP 2.3).
>> The timelineserver process ( more
>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>> class) takes over 600% of CPU, generating enormous load on my master node.
>> I can't guess why it happens.
>>
>> First, I run the timelineserver using java 8, thought that this was an
>> issue. But no, I started timelineserver now with use of java 7 and still
>> the problem is the same.
>>
>> My cluster is tiny- it consists of:
>> - 2 HDFS nodes
>> - 2 HBase RegionServers
>> - 2 Kafkas
>> - 2 Spark nodes
>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>
>> I'll be very grateful for your help here. If you need any more info,
>> please write.
>> I also attach yarn-site.xml grepped to options related to timeline server.
>>
>> And here is a command of timeline that I see from ps :
>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>> -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -Dyarn.policy.file=hadoop-policy.xml
>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -classpath
>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>
>>
>> Thanks!
>> Krzysztof
>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if
possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> I see the following size:
> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
> 3307812 /var/lib/hadoop/yarn/timeline
>
> The timeline service has been multiple times restarted as I was looking
> for issue with it. But it was installed about a 2 months ago. Just few
> applications (1?2? ) has been started since its last start. The
> ResourceManager interface has 261 entries.
>
> As in yarn-site.xml that I attached, the variable you're asking for has
> the following value:
>
> <property>
>
>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>       <value>300000</value>
> </property>
>
>
> Ah, One more thing: When I looked with jstack to see what the process is
> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
> *think* it is related to leveldb store.
>
> Please ask if any more information is needed.
> Any help is appreciated! Thanks
> Krzysiek
>
> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi ,
>>
>> Whats the size of Store Files?
>> Since when is it running ? how many applications have been run since it
>> has been started ?
>> Whats the value of "
>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Wednesday, September 30, 2015 19:20
>> *To:* user@hadoop.apache.org
>> *Subject:* YARN timelineserver process taking 600% CPU
>>
>> Hi there Hadoopers,
>> I have a serious issue with my installation of Hadoop & YARN in version
>> 2.7.1 (HDP 2.3).
>> The timelineserver process ( more
>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>> class) takes over 600% of CPU, generating enormous load on my master node.
>> I can't guess why it happens.
>>
>> First, I run the timelineserver using java 8, thought that this was an
>> issue. But no, I started timelineserver now with use of java 7 and still
>> the problem is the same.
>>
>> My cluster is tiny- it consists of:
>> - 2 HDFS nodes
>> - 2 HBase RegionServers
>> - 2 Kafkas
>> - 2 Spark nodes
>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>
>> I'll be very grateful for your help here. If you need any more info,
>> please write.
>> I also attach yarn-site.xml grepped to options related to timeline server.
>>
>> And here is a command of timeline that I see from ps :
>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>> -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -Dyarn.policy.file=hadoop-policy.xml
>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -classpath
>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>
>>
>> Thanks!
>> Krzysztof
>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Naganarasimha Garla <na...@gmail.com>.

Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if
possible the jstack output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k....@gmail.com>
wrote:

> Hi Naga,
> I see the following size:
> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
> 3307812 /var/lib/hadoop/yarn/timeline
>
> The timeline service has been multiple times restarted as I was looking
> for issue with it. But it was installed about a 2 months ago. Just few
> applications (1?2? ) has been started since its last start. The
> ResourceManager interface has 261 entries.
>
> As in yarn-site.xml that I attached, the variable you're asking for has
> the following value:
>
> <property>
>
>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>       <value>300000</value>
> </property>
>
>
> Ah, One more thing: When I looked with jstack to see what the process is
> doing, I saw threads spending time in NATIVE in leveldbjni library. So I
> *think* it is related to leveldb store.
>
> Please ask if any more information is needed.
> Any help is appreciated! Thanks
> Krzysiek
>
> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi ,
>>
>> Whats the size of Store Files?
>> Since when is it running ? how many applications have been run since it
>> has been started ?
>> Whats the value of "
>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Wednesday, September 30, 2015 19:20
>> *To:* user@hadoop.apache.org
>> *Subject:* YARN timelineserver process taking 600% CPU
>>
>> Hi there Hadoopers,
>> I have a serious issue with my installation of Hadoop & YARN in version
>> 2.7.1 (HDP 2.3).
>> The timelineserver process ( more
>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>> class) takes over 600% of CPU, generating enormous load on my master node.
>> I can't guess why it happens.
>>
>> First, I run the timelineserver using java 8, thought that this was an
>> issue. But no, I started timelineserver now with use of java 7 and still
>> the problem is the same.
>>
>> My cluster is tiny- it consists of:
>> - 2 HDFS nodes
>> - 2 HBase RegionServers
>> - 2 Kafkas
>> - 2 Spark nodes
>> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>>
>> I'll be very grateful for your help here. If you need any more info,
>> please write.
>> I also attach yarn-site.xml grepped to options related to timeline server.
>>
>> And here is a command of timeline that I see from ps :
>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>> -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -Dyarn.policy.file=hadoop-policy.xml
>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>> -classpath
>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>
>>
>> Thanks!
>> Krzysztof
>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for
issue with it. But it was installed about a 2 months ago. Just few
applications (1?2? ) has been started since its last start. The
ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the
following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>


Ah, One more thing: When I looked with jstack to see what the process is
doing, I saw threads spending time in NATIVE in leveldbjni library. So I
*think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi ,
>
> Whats the size of Store Files?
> Since when is it running ? how many applications have been run since it
> has been started ?
> Whats the value of "
> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Wednesday, September 30, 2015 19:20
> *To:* user@hadoop.apache.org
> *Subject:* YARN timelineserver process taking 600% CPU
>
> Hi there Hadoopers,
> I have a serious issue with my installation of Hadoop & YARN in version
> 2.7.1 (HDP 2.3).
> The timelineserver process ( more
> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
> class) takes over 600% of CPU, generating enormous load on my master node.
> I can't guess why it happens.
>
> First, I run the timelineserver using java 8, thought that this was an
> issue. But no, I started timelineserver now with use of java 7 and still
> the problem is the same.
>
> My cluster is tiny- it consists of:
> - 2 HDFS nodes
> - 2 HBase RegionServers
> - 2 Kafkas
> - 2 Spark nodes
> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>
> I'll be very grateful for your help here. If you need any more info,
> please write.
> I also attach yarn-site.xml grepped to options related to timeline server.
>
> And here is a command of timeline that I see from ps :
> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
> -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -Dyarn.policy.file=hadoop-policy.xml
> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -classpath
> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>
>
> Thanks!
> Krzysztof
>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for
issue with it. But it was installed about a 2 months ago. Just few
applications (1?2? ) has been started since its last start. The
ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the
following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>


Ah, One more thing: When I looked with jstack to see what the process is
doing, I saw threads spending time in NATIVE in leveldbjni library. So I
*think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi ,
>
> Whats the size of Store Files?
> Since when is it running ? how many applications have been run since it
> has been started ?
> Whats the value of "
> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Wednesday, September 30, 2015 19:20
> *To:* user@hadoop.apache.org
> *Subject:* YARN timelineserver process taking 600% CPU
>
> Hi there Hadoopers,
> I have a serious issue with my installation of Hadoop & YARN in version
> 2.7.1 (HDP 2.3).
> The timelineserver process ( more
> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
> class) takes over 600% of CPU, generating enormous load on my master node.
> I can't guess why it happens.
>
> First, I run the timelineserver using java 8, thought that this was an
> issue. But no, I started timelineserver now with use of java 7 and still
> the problem is the same.
>
> My cluster is tiny- it consists of:
> - 2 HDFS nodes
> - 2 HBase RegionServers
> - 2 Kafkas
> - 2 Spark nodes
> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>
> I'll be very grateful for your help here. If you need any more info,
> please write.
> I also attach yarn-site.xml grepped to options related to timeline server.
>
> And here is a command of timeline that I see from ps :
> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
> -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -Dyarn.policy.file=hadoop-policy.xml
> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -classpath
> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>
>
> Thanks!
> Krzysztof
>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for
issue with it. But it was installed about a 2 months ago. Just few
applications (1?2? ) has been started since its last start. The
ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the
following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>


Ah, One more thing: When I looked with jstack to see what the process is
doing, I saw threads spending time in NATIVE in leveldbjni library. So I
*think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi ,
>
> Whats the size of Store Files?
> Since when is it running ? how many applications have been run since it
> has been started ?
> Whats the value of "
> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Wednesday, September 30, 2015 19:20
> *To:* user@hadoop.apache.org
> *Subject:* YARN timelineserver process taking 600% CPU
>
> Hi there Hadoopers,
> I have a serious issue with my installation of Hadoop & YARN in version
> 2.7.1 (HDP 2.3).
> The timelineserver process ( more
> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
> class) takes over 600% of CPU, generating enormous load on my master node.
> I can't guess why it happens.
>
> First, I run the timelineserver using java 8, thought that this was an
> issue. But no, I started timelineserver now with use of java 7 and still
> the problem is the same.
>
> My cluster is tiny- it consists of:
> - 2 HDFS nodes
> - 2 HBase RegionServers
> - 2 Kafkas
> - 2 Spark nodes
> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>
> I'll be very grateful for your help here. If you need any more info,
> please write.
> I also attach yarn-site.xml grepped to options related to timeline server.
>
> And here is a command of timeline that I see from ps :
> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
> -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -Dyarn.policy.file=hadoop-policy.xml
> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -classpath
> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>
>
> Thanks!
> Krzysztof
>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for
issue with it. But it was installed about a 2 months ago. Just few
applications (1?2? ) has been started since its last start. The
ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the
following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>


Ah, One more thing: When I looked with jstack to see what the process is
doing, I saw threads spending time in NATIVE in leveldbjni library. So I
*think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi ,
>
> Whats the size of Store Files?
> Since when is it running ? how many applications have been run since it
> has been started ?
> Whats the value of "
> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Wednesday, September 30, 2015 19:20
> *To:* user@hadoop.apache.org
> *Subject:* YARN timelineserver process taking 600% CPU
>
> Hi there Hadoopers,
> I have a serious issue with my installation of Hadoop & YARN in version
> 2.7.1 (HDP 2.3).
> The timelineserver process ( more
> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
> class) takes over 600% of CPU, generating enormous load on my master node.
> I can't guess why it happens.
>
> First, I run the timelineserver using java 8, thought that this was an
> issue. But no, I started timelineserver now with use of java 7 and still
> the problem is the same.
>
> My cluster is tiny- it consists of:
> - 2 HDFS nodes
> - 2 HBase RegionServers
> - 2 Kafkas
> - 2 Spark nodes
> - 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.
>
> I'll be very grateful for your help here. If you need any more info,
> please write.
> I also attach yarn-site.xml grepped to options related to timeline server.
>
> And here is a command of timeline that I see from ps :
> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
> -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -Dyarn.policy.file=hadoop-policy.xml
> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
> -classpath
> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>
>
> Thanks!
> Krzysztof
>
>

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof

RE: YARN timelineserver process taking 600% CPU

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native -classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof