You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2015/05/04 17:55:08 UTC
[jira] [Updated] (TEZ-2076) Tez framework to extract/analyze data stored in ATS for specific dag

     [ https://issues.apache.org/jira/browse/TEZ-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan updated TEZ-2076:
----------------------------------
    Attachment: TEZ-2076.10.patch


ATSImportTool
	- Fixed docs
	- Fixed logging in case of exception
	- Fixed x.y.z for version info
	- Made the packaging as a fat jar.  (--atsAddress=http://atsServer:port can be provided in command line as optional parameter if needed. Otherwise, it would be picked up from $HADOOP_CONF_DIR location)
	- Usage:
	{noformat}
usage: java -cp $HADOOP_CONF_DIR/:./target/tez-history-parser-x.y.z-SNAPSHOT-jar-with-dependencies.jar org.apache.tez.history.ATSImportTool
    --atsAddress <atsAddress>     Optional. ATS address (e.g http://clusterATSNode:8188)
    --dagId <dagId>               DagId that needs to be downloaded
    --downloadDir <downloadDir>   download directory where data needs to be downloaded
    --help                        print help
	{noformat}

>> What happens when some of the data is downloaded but some fails to?
- This would require parsing of downloaded data (e.g, ATS goes down in the middle of download).  Currently this is not checked & would throw exception.  However, we would get partial data (i.e as and when a batch is downloaded, it gets written to zip file). Not sure if we need a this feature to validate.  I believe exception should be good for v1.

>> What happens if the tool is run when a dag is still in progress? Will it give invalid data back? Should that case be handled by throwing an error or just having the user warned as needed?
- Currently, if data is available (even partial in the case of running jobs) it would be downloaded. Is the suggestion not to download if job is in progress (e.g RUNNING, INITING, SUBMITTED)?

>> Maybe BaseInfo and then use abstract class?
- Fixed. Renamed AbstractInfo to BaseInfo

>> Should all info objects representing the data be moved to a package say parser.datamodel ?
- Moved all info objects ot parser.datamodel.  Also created BaseParser which can link task, vertex, dag etc for reverse lookups.

>> How is versioning being handled in the serialized zip structure? Also, why json as compared to say a protobuf structure?
- No explicit version is maintained in zip structure. Adding tez-version be helpful?
- Moving back and forth from DAG-->TaskAttempt and TaskAttempt-->DAG can be complex in protobuf.  Hence the objects are maintained as POJO in-memory structure after parsing JSON.

>> What if there are 100,000 attempts? or more? Does this require a large memory footprint?
- No, zip file can have numerous number of small part files. Each of them can contain some amount of task, attempt, vertex, dag information. As and when the part file is parsed, the JSON object pertaining to that part file is released. So there wouldn't be much pressure during parsing. However, the DAG in-memory representation (POJO) can differ based on the size of of the jobs. I will post the memory details soon.

>> Should serialized data be loaded on an demand basis? Or does the analyser always take an initial hit to load all data into memory?
- It might be memory effecient, but would make it hard for analysis. For analysis, we would like to move back and forth from DAG-->TaskAttempt and vice-versa.  This would call for all objects to be present in memory. 

>> It seems like we have 2 data models. The runtime model and the analyser data model. It is going to be hard to keep them in sync. Any suggestions on how we can re-use a common model?
- No; ATS data is parsed and represented as in-memory POJOs via parser. Analyzer would work on the in-memory (read only) structures. Irrespective of any other changes in ATS, in-memory representations of DAG,Vertex, Task, TaskAttempts should not change. 

>> getAbsoluteSubmitTime() - is there a non-absolute timestamp elsewhere? Maybe simplify function names?
- Yes, getSubmitTime() would return the timing w.r.t to DAG start time.  This would be useful when drawing swimlane diagrams for instance. Renamed to getAbsStartTime() for now (any suggestions?)

>> Could you clarify why most classes are marked public?
- All info objects would be public (evolving) as the analyzer code would rely on these in-memory objects.

>> void setTaskInfo(TaskInfo taskInfo)
- As mentioned earlier, zip file can have arbitary number of part files. Each part file is parsed and an in-memory POJO is created. Before returning the final DAG (in-memory structure), we need to link task to attempts, vertex to DAG etc.  These links happen via these methods which are not publicly exposed. 

>> it would be good to try the tool with invalid data, corrupt zip files, etc to ensure that there is useful error messages.
- In case of corrupt file, it would throw exception. E.g
{noformat}
Exception in thread "main" org.apache.tez.dag.api.TezException: java.util.zip.ZipException: error in opening zip file
{noformat}
Haven't added much constraints for data validation. Will add that.

>> the import tool should be run against an invalid timeline server for e.g make it hit the RM on port 8088 so that there is a valid webserver serving but returns back 404s, etc. , a server which times out, etc.
- Covered in test case.

>> invalid arguments are needed. A downloadDir named " " would be problematic for the tool.
- Preconditions are added.

>> pom file needs 2 space tabs not 4.
- Fixed.

>> pom file contains too many dependencies. Not sure why there is a dependency on mr and hdfs jars for example.
- Removed unwanted dependencies.  mr/hdfs are mainly for testing.

> Tez framework to extract/analyze data stored in ATS for specific dag
> --------------------------------------------------------------------
>
>                 Key: TEZ-2076
>                 URL: https://issues.apache.org/jira/browse/TEZ-2076
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2076.1.patch, TEZ-2076.10.patch, TEZ-2076.2.patch, TEZ-2076.3.patch, TEZ-2076.4.patch, TEZ-2076.5.patch, TEZ-2076.6.patch, TEZ-2076.7.patch, TEZ-2076.8.patch, TEZ-2076.9.patch, TEZ-2076.WIP.2.patch, TEZ-2076.WIP.3.patch, TEZ-2076.WIP.patch
>
>
> - Users should be able to download ATS data pertaining to a DAG from Tez-UI (more like a zip file containing DAG/Vertex/Task/TaskAttempt info).
> - This can be plugged to an analyzer which parses the data, adds semantics and provides an in-memory representation for further analysis.
> - This will enable to write different analyzer rules, which can be run on top of this in-memory representation to come up with analysis on the DAG.
> - Results of this analyzer rules can be rendered on to UI (standalone webapp) later point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)