You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/06/04 21:24:38 UTC

[jira] [Updated] (TEZ-2485) Reduce the Resource Load on the Timeline Server

     [ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated TEZ-2485:
----------------------------
    Attachment: ats-omit-dup-display-names-and-zero-counters.patch

Posting a prototype patch that does two main things to trim the amount of JSON being emitted by the AM for ATS events:

* Omits sending the display name for counters and counter groups if the display name is the same as the name.
* Omits sending counters that have a zero value

These two changes cut the amount of JSON being sent for the entire application to approximately half for large applications.  This is because the ATS data is primarily dominated by counters in the task and task attempt finished events.

I had to make a couple of tweaks to the UI to get it to deal with a missing display name.  I'm far from a UI expert, apologies for any butchery there, but it seems to work in practice when I tested it.

> Reduce the Resource Load on the Timeline Server
> -----------------------------------------------
>
>                 Key: TEZ-2485
>                 URL: https://issues.apache.org/jira/browse/TEZ-2485
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jonathan Eagles
>         Attachments: TEZ-2485.REMOVE_TEZ_CONTAINER_ID.1.patch, TEZ-2485.SHORTER_ENTITIES.1.patch, ats-omit-dup-display-names-and-zero-counters.patch
>
>
> The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. 
> Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a
> day.
> While I understand there is community effort on timeline server v2, it
> will be good if Tez can reduce its pressure on the timeline server by
> auditing both the number of events and size of events.
> Here are some observations based on my understanding of the design of
> timeline stores:
> Each timeline entity pushed explodes into many records in the database
> 1 marker record
> 1 domain record
> 1 record per event
> 2 records per related entity
> 2 records per primary filter (2 record per primary filter in
> RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
> records per primary filter )
> 1 record per other info
> For example
> Task Attempt Start
> 1 marker
> 1 domain
> 1 task attempt start event
> 1 related entity X 2
> 7 other info entries
> 4 primary filters X 2
> 20 records written in the database for task attempt start
> Task Attempt Finish
> 1 marker
> 1 domain
> 1 task attempt start event
> 1 related entity X 2
> 5 other info entries
> 5 primary filters X 2
> 20 records written in the database for task attempt finish
> =====================================================
> QUESTION:
> =====================================================
> Is there any data we are publishing to the timeline server that is not
> in the UI?
> Do we use all the entities (TEZ_CONTAINER_ID for example)
> Do we use all the primary filters?
> Do we use all the related entities specified?
> Are there any fields we don't use?
> Are there other approaches to consider to reduce entity count/size?
> Is there a way to store the same information in less space?
> ===================
> Key Value Breakdown
> ||Count||Key Size||Value Size||
> |5642512|533690380|745454867|
> Entity Type Breakdown
> ||Type||Count||Key Size||Value Size||
> |TEZ_CONTAINER_ID|843850|86244392|5654341|
> |applicationAttemptId|544|53248|6174|
> |applicationId|544|44412|6174|
> |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209|
> |TEZ_APPLICATION|1048|84312|13057630|
> |containerId|362443|37013813|4135845|
> |TEZ_VERTEX_ID|99239|10387114|1559948|
> |TEZ_DAG_ID|5402|387705|2910830|
> |TEZ_TASK_ID|1762211|146210017|344478400|
> |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316|
> Column Breakdown
> ||Column||Count||Key Size||Value Size||
> |primarykeys|1092413|118768299|0|
> |marker|373515|25740507|2988120|
> |events|578196|55148482|1156392|
> |domain|373515|26114022|15314115|
> |reverserelated|587815|73721347|0|
> |otherinfo|2143751|170983893|725996240|
> |related|493307|63213830|0|
> Other Info Key Breakdown
> ||Key||Count||Key Size||Value Size||
> |appSubmitTime|126|11466|1638|
> |vertexName|349|23732|3081|
> |stats|349|21987|142938|
> |applicationId|163|10106|5705|
> |exitStatus|84337|7337319|84559|
> |endTime|288538|22354866|3750994|
> |counters|204201|15474759|646685059|
> |startTime|204201|15678960|2654613|
> |nodeId|106761|8540880|3950157|
> |initTime|512|32325|6656|
> |numKilledTasks|512|35397|517|
> |timeTaken|204201|15678960|1061085|
> |inProgressLogsURL|106761|9715251|11741572|
> |config|126|8820|13037092|
> |scheduledTime|96928|7172672|1260064|
> |dagPlan|163|9128|2074899|
> |completedLogsURL|106761|9608490|22703699|
> |taskAttemptErrorEnum|15808|1485952|331784|
> |initRequestedTime|349|26175|4537|
> |startRequestedTime|349|26524|4537|
> |numFailedTasks|512|35397|512|
> |vertexNameIdMapping|163|11084|16157|
> |numSucceededTasks|512|36933|1054|
> |numKilledTaskAttempts|512|38981|521|
> |status|204201|15066357|2198349|
> |processorClassName|349|26524|18690|
> |numFailedTaskAttempts|512|38981|512|
> |tezVersion|126|9324|14364|
> |numTasks|349|23034|665|
> |successfulAttemptId|96785|7742800|4355325|
> |nodeHttpAddress|106761|9501729|3950157|
> |numCompletedTasks|512|36933|1056|
> |diagnostics|204201|16087362|915925|
> |containerId|106761|9074685|5017767|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)