You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Hitesh Shah <hi...@apache.org> on 2014/01/13 23:02:52 UTC

Fwd: Tez compatibility with MR

Forwarding to the user@ list as there was some interest in the MR-Tez compatibility story. 	

Begin forwarded message:

> From: Hitesh Shah <hi...@apache.org>
> Date: January 10, 2014 3:45:33 PM PST
> To: dev@tez.incubator.apache.org
> Subject: Re: Tez compatibility with MR
> 
> Hi Jonathan 
> 
> Most of the points below summarize the views of the current Tez devs:
> 
> We have tried to get a lot of MR aspects to work on Tez but not to full completion as most of the current contributors moved on to focus on other aspects of Tez. 
> With respect to MR, some features we have not gotten around to or may not even be aware of. And there are some minor things that may not make sense for Tez to support.
> 
> Broad categories/missing features:
> 
> i) Job History: The plan is to use YARN Application History/Timeline to create Tez specific history. There is no history/UI support at the moment whether for a running AM or post job completion.
> ii) Recovery: In the works in conjunction with the above history implementation 
> iii) Configuration knobs: we have run a set of MR system tests and fixed a bunch of compatibility issues seen. I am sure, as more folks try MR on Tez, we will discover more gaps. 
> iv) Task run-time: There are still minor issues which probably need to be addressed. For example, TEZ-637 to set all the required bits needed by MR components. Progress support is not fully functional as of now.
> v) Speculation: No one has started work on this yet.
> iv) Command-line tools: 
> 
> Taking a look at bin/mapred job:
> 
> 	[-submit <job-file>]
> 	[-status <job-id>]
> 	[-counter <job-id> <group-name> <counter-name>]
> 	[-kill <job-id>]
> 	[-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW
> 	[-events <job-id> <from-event-#> <#-of-events>]
> 	[-history <jobHistoryFile>]
> 	[-list [all]]
> 	[-list-active-trackers]
> 	[-list-blacklisted-trackers]
> 	[-list-attempt-ids <job-id> <task-type> <task-state>]. Valid values for <task-type> are REDUCE MAP. Valid values for <task-state> are running, completed
> 	[-kill-task <task-attempt-id>]
> 	[-fail-task <task-attempt-id>]
> 	[-logs <job-id> <task-attempt-id>]
> 
> By running MR tasks within the Tez context, there is obviously quite some information lost. This is a big gap currently - partially as we have not looked at it and also as a open design question as to what should be supported. For example, today, there is no support for a task to provide any general update information back to the AM ( which could then be exposed to the client). It becomes a tricky question as to what an overall task state means when a task consists of a single processor, multiple inputs and multiple outputs.  
> 
> We will take any help we can get :). If you are specifically looking at MR compatibility, the above list can get you started. Or you can start by trying to run your existing MR jobs against Tez and looking at bugs/features gaps. 
> 
> thanks
> -- Hitesh
> 
> On Jan 10, 2014, at 11:19 AM, Jonathan Eagles wrote:
> 
>> I have seen some comments on missing functionality in Tez such as
>> 
>> "MapReduce on Tez is not 100% compatible with traditional MapReduce -
>> example the functionality available on the JobClient to track individual
>> tasks is missing."
>> 
>> It's not quite clear to me at this point all the missing pieces and whether
>> those are design limitations or just not enough hands to get to them all
>> due to other more pressing priorities. If the latter, I'd be happy to help
>> out to add these or other features if there is need.
>> 
>> jeagles
>