You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by 郭士伟 <gu...@gmail.com> on 2015/12/06 08:14:53 UTC

Re: YARN timelineserver process taking 600% CPU

It seems that it's the large leveldb size that cause the problem. What is
the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
enough so we have too much entities in timeline store.
And by the way, it will take a long time (hours) when the ATS do discard
old entity operation, and it will also block the other operations. The
patch https://issues.apache.org/jira/browse/YARN-3448 is a great
performance improve. We just backport it and it works well.

2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com>:

> Hi Krzysiek,
>
>
>
> *There are currently 8 Spark Streaming jobs constantly running, each 3
> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
> ATS.  How could I check what precisely is doing what or how to get some
> logs about it, I don't know...*
>
> Not sure about the applications being run and if you have already tried
> disabling the "Spark History Server doing the puts ATS" then not sure if
> the apps are sending it out . AFAIK Spark history server had not integrated
> with ATS (SPARK-1537). So most propably its the applications which are
> pumping in the data. I think you need to check with them itself.
>
>
> *2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load? *
>
> Its not been used in large scale by us but according YARN-2556 (ATS
> Performance Test Tool), it states that "On a 36 node cluster, this
> results in ~830 concurrent containers (e.g maps), each firing 10KB of
> payload, 20 times." but only thing being different is, data in your
> system is already overloaded hence cost of querying (which is currently
> happening during each insertion) is very high.
>
> May be guys from other company who have used or supported ATSV1 might be
> able to tell the ATSV1 scale better !
>
>
> Regards,
>
> + Naga
> ------------------------------
> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
> *Sent:* Thursday, November 05, 2015 19:51
> *To:* user@hadoop.apache.org
> *Subject:* Re: YARN timelineserver process taking 600% CPU
>
> Thanks Naga for your input,  (I'm sorry for a late response, I was out for
> some time).
>
> So you believe that Spark is actually doing the PUTs? There are currently
> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
> 10 s. I believe these are the jobs that publish to ATS.  How could I check
> what precisely is doing what or how to get some logs about it, I don't
> know...
> I though maybe it is Spark History Server doing the puts, but it seems it
> is not, as I disabled it and the load hasn't gone down. So it seems these
> are the jobs itself indeed.
>
> Now I have the following problems:
> 1. The most important: How can I at least *workaround* this issue? Maybe
> I will somehow disable Spark usage of Yarn timelineserver ? What are the
> consequences? Is it only history of Spark finished jobs not being saved? If
> yes, that doesn't hurt that much. Probably this is a question to Spark
> group...
> 2. Is 8 concurrent Spark Streaming jobs really that high for
> Timelineserver? I have just a small cluster, how other larger companies are
> handling much larger load?
>
> Thanks for helping me with this!
> Krzysiek
>
>
>
>
>
>
>
>
>
>
> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com
> >:
>
>> Hi Krzysiek,
>> Oops My mistake, 3 Gb seems to be on little higher side.
>> And from the jstack it seems like there were no major activity other than
>> puts seems like around 16 concurrent puts were happening which tries to get
>> the timeline Entity hence hitting the native call.
>>
>> From the logs it seems like lot of ACL validations are happening and from
>> the URL it seems like its for PUTEntites.
>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>> and if all of these are for puts then roughly about 10 put calls/s is
>> happening from *spark* side. This i feel is not right usage of ATS, can
>> you check what is being published from the spark to ATS at this high rate ?
>>
>> Besides some improvements regarding the timeline metrics is available in
>> trunk as part of YARN-3360 which could have been useful in analyzing your
>> issue.
>>
>> + Naga
>>
>>
>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k....@gmail.com>
>> wrote:
>>
>>> Hi Naga,
>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>> numbers in kB). Does that seems reasonable as well?
>>> There are new .sst files generated each minute.
>>> There are now 26850 files in leveldb-timeline-store directory. New files
>>> are generated each minute. Some are also being deleted.
>>>
>>> I started timeline server today, to gather logs and jstack, it was
>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>
>>> Thank you for helping me debug this.
>>> Krzysiek
>>>
>>>
>>>
>>>
>>>
>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>> naganarasimha.gr@gmail.com>:
>>>
>>>> Hi Krzysiek,
>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>> if possible the jstack output for the AHS process
>>>>
>>>> + Naga
>>>>
>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>> k.zarzycki@gmail.com> wrote:
>>>>
>>>>> Hi Naga,
>>>>> I see the following size:
>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>
>>>>> The timeline service has been multiple times restarted as I was
>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>> few applications (1?2? ) has been started since its last start. The
>>>>> ResourceManager interface has 261 entries.
>>>>>
>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>> has the following value:
>>>>>
>>>>> <property>
>>>>>
>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>       <value>300000</value>
>>>>> </property>
>>>>>
>>>>>
>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>> *think* it is related to leveldb store.
>>>>>
>>>>> Please ask if any more information is needed.
>>>>> Any help is appreciated! Thanks
>>>>> Krzysiek
>>>>>
>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>> garlanaganarasimha@huawei.com>:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Whats the size of Store Files?
>>>>>> Since when is it running ? how many applications have been run since
>>>>>> it has been started ?
>>>>>> Whats the value of "
>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>
>>>>>> + Naga
>>>>>> ------------------------------
>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>
>>>>>> Hi there Hadoopers,
>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>> version 2.7.1 (HDP 2.3).
>>>>>> The timelineserver process ( more
>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>> I can't guess why it happens.
>>>>>>
>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>> the problem is the same.
>>>>>>
>>>>>> My cluster is tiny- it consists of:
>>>>>> - 2 HDFS nodes
>>>>>> - 2 HBase RegionServers
>>>>>> - 2 Kafkas
>>>>>> - 2 Spark nodes
>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>> TOTAL.
>>>>>>
>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>> please write.
>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>> server.
>>>>>>
>>>>>> And here is a command of timeline that I see from ps :
>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>> -classpath
>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Krzysztof
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.
Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.
Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.
Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: YARN timelineserver process taking 600% CPU

Posted by Krzysztof Zarzycki <k....@gmail.com>.
Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <gu...@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library. So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this was
>>>>>>> an issue. But no, I started timelineserver now with use of java 7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>