You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Filip Haase <ha...@googlemail.com> on 2014/06/13 17:03:27 UTC

Tez Debugging, JVMs and Containers and DataMovementEvents

Hi,

  I'm a student of TU-Berlin working in the Stratosphere(or Apache 
Flink) project. Currently I'm working with TEZ with the goal to 
implement "Stratosphere-On-TEZ" which shall be using TEZ as runtime for 
executing Stratosphere jobs.
  In my Github repo https://github.com/filiphaase/incubator-tez I also 
have a first prototype using stratosphere components like 
serializers/comparators/input-formats/.... in TEZ.

After working with TEZ a bit, I gathered a few questions:
- How to debug efficiently? Can I run a set of processors locally from 
within the IDE(IntelliJ)? And how to best attach to a container JVM? 
(Currently I'm submitting Jars to hadoop and read log files for 
debugging, which is very time-intensive)
- Is the assumption that each task execution is in it's own JVM correct?
- Is container reuse only used for the same AM, or accross AMs? If it is 
accross AMs, is it only within the same AM type ? (not across f.e. 
Stratosphere and Hive)
- My understanding is that TEZ always uses DataMovementEvents for 
shipping data between processors. And there are two ways to do this: 
Ship data in the user-payload of the event ("dataViaEvents") or have no 
data in the user-payload and only describe the location of the data. Is 
this correct? If yes: When should which variant be used?

Thanks and Regards,
Filip Haase

Re: Tez Debugging, JVMs and Containers and DataMovementEvents

Posted by Hitesh Shah <hi...@apache.org>.

Hello Filip, 

Answers inline.

On Jun 13, 2014, at 8:03 AM, Filip Haase <ha...@googlemail.com> wrote:

> Hi,
> 
> I'm a student of TU-Berlin working in the Stratosphere(or Apache Flink) project. Currently I'm working with TEZ with the goal to implement "Stratosphere-On-TEZ" which shall be using TEZ as runtime for executing Stratosphere jobs.
> In my Github repo https://github.com/filiphaase/incubator-tez I also have a first prototype using stratosphere components like serializers/comparators/input-formats/.... in TEZ.
> 
> After working with TEZ a bit, I gathered a few questions:
> - How to debug efficiently? Can I run a set of processors locally from within the IDE(IntelliJ)? And how to best attach to a container JVM? (Currently I'm submitting Jars to hadoop and read log files for debugging, which is very time-intensive)

Debuggability from a dev point of view is something which has not been addressed properly.  A couple of folks are working on trying to address this aspect as well as trying to getting the full DAG to run within a single process. @Chen/@Oleg, could you provide pointers to help Filip? 

> - Is the assumption that each task execution is in it's own JVM correct?

A task runs within a single JVM. However, if container re-use is enabled, a single JVM could end up running multiple tasks ( serially ).

> - Is container reuse only used for the same AM, or accross AMs? If it is accross AMs, is it only within the same AM type ? (not across f.e. Stratosphere and Hive)

Container re-use is within the same application i.e same AM. 

> - My understanding is that TEZ always uses DataMovementEvents for shipping data between processors. And there are two ways to do this: Ship data in the user-payload of the event ("dataViaEvents") or have no data in the user-payload and only describe the location of the data. Is this correct? If yes: When should which variant be used?
> 

DataMovementEvents are effectively a way for an Output to inform an Input as to the location of the data. Each Input/Output pair could define its own semantics and pass information through the event’s payload. For some inbuilt Inputs/Outputs, we experimented with trying to send the data via the event itself in scenarios where the data size was very small. This feature is experimental so I would probably steer clear of it for now. Are you planning to write your own Input/Output pairs or planning to re-use some of the current ones? 

thanks
— Hitesh

Re: Tez Debugging, JVMs and Containers and DataMovementEvents

Posted by Siddharth Seth <ss...@apache.org>.

Reply inline.
On Fri, Jun 13, 2014 at 8:03 AM, Filip Haase <ha...@googlemail.com>
wrote:

> Hi,
>
>  I'm a student of TU-Berlin working in the Stratosphere(or Apache Flink)
> project. Currently I'm working with TEZ with the goal to implement
> "Stratosphere-On-TEZ" which shall be using TEZ as runtime for executing
> Stratosphere jobs.
>  In my Github repo https://github.com/filiphaase/incubator-tez I also
> have a first prototype using stratosphere components like
> serializers/comparators/input-formats/.... in TEZ.
>
> After working with TEZ a bit, I gathered a few questions:
> - How to debug efficiently? Can I run a set of processors locally from
> within the IDE(IntelliJ)? And how to best attach to a container JVM?
> (Currently I'm submitting Jars to hadoop and read log files for debugging,
> which is very time-intensive)
>
Starting a processor, as part of an executing job, within an IDE can be
difficult. Oleg is doing some work to simplify this, and maybe he could
chip in. Also, there's some work on LocalMode  (
https://issues.apache.org/jira/browse/TEZ-684) which could help with
debugging - everything within the same JVM, but is also work in progress.
If you're looking to debug a processor only - your best bet would be to
write a test running the TezChild within process, with the correct set of
Inputs and Outputs, and generate the required DataMovementEvents in the
test.
TEZ-1032 is another item to look at, which allows specific tasks to have
additional JVM launch parameters. You could try making use of this and then
remotely connect to the specific task.

> - Is the assumption that each task execution is in it's own JVM correct?
>
At the moment, each JVM executes one task at a time. The same JVM, with
re-use, could execute multiple tasks - but only one at a time.

> - Is container reuse only used for the same AM, or accross AMs? If it is
> accross AMs, is it only within the same AM type ? (not across f.e.
> Stratosphere and Hive)
>
Container re-use is only with the same AM.

> - My understanding is that TEZ always uses DataMovementEvents for shipping
> data between processors. And there are two ways to do this: Ship data in
> the user-payload of the event ("dataViaEvents") or have no data in the
> user-payload and only describe the location of the data. Is this correct?
> If yes: When should which variant be used?
>
Using 'dataViaEvents' is not recommended. That is experimental (and will
likely be removed) on the UnsortedUnpartitionedOutput, and would only work
for really small data sets. DataMovementEvents should be used to specify
meta-information about where data is to be fetched from.

HTH
Thanks
- Sid


>
> Thanks and Regards,
> Filip Haase
>
>