You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Christian Tzolov <ch...@gmail.com> on 2014/07/01 14:16:37 UTC

Visualizing internal pipeline preparation stages

Hi,

While exploring the Crunch MR execution flow I decided to augment the
excellent pipeline DOT diagram with few additional visualizations of some
interesting (for me) internal/intermediate pipeline preparation states.
Such like the output-pcollection-targets structure (used for the pipeline
planning), the Graphs before and after the split up of dependent GBK nodes
and the RTNode hierarchy as persistent in the Configuration before the
execution of the pipeline.
For each diagram I've plotted some relevant internals like the PTypte
structures. The implementation hack includes 3 additional DotfileWriters
hooked inside the MSCRPlanner#plan() to intercept the flow.

An example of the diagrams generated from the
org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked below.

Do we need such internals visualization? Something like visualization of
the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
preparation?  What do you think?

Cheers,
Christian


Diagrams generated from the
org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.

- Dotfile containing all graphs:
https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot


1.
https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
- is the existing diagram. It provides very well balanced view of the
pipeline, showing how the functional blocks are mapped into execution
Map/Reduce components and the dependencies between them.

2.
https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
- Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs) in
the MSCRPlanner on plan() operation is execution:
- Each data flow is depicted with different color to indicate the
overlapping execution paths.
- The PCollection name, class and PTypes are shown.

3.
https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
- Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. It
draws the Vertices with their names, pcollection and ptype. The arc label
lists the Graph's edge path lists.

4.
https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
- Graph created in the MSCRPlanner#plan() after the splits up of dependent
GBK nodes and break the graph up into connected components - bounded by
read dashed line.

5.
https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
- Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer as
well as the Inputs and Outputs.
- RTNodes are  deserialized from the Job's
CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to
the containing Map or Reduce tasks and parent Crunch Job. The relationship
between RTNodes (e.g. parent/children)  is depicted with arrows.
- Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String,
OutputConfig> and depicted in the magenta subgraph
- Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
Map<Integer, List<Path>>> and depicted in green subgraph
- The inputs are mapped to the corresponding RTNode using the nodeIndex
reference.
- Outputs are mapped to the corresponding RTNode by the Output Name
references
- There is not good way to print the anonymous DoFn instances.
- Note: the dependency between the crunch jobs is not drawn as it my
require access to the competition hook attributes.
- Note: in order to draw the RTNodes i had to expose its attributes via
public getters.

Re: Visualizing internal pipeline preparation stages

Posted by Josh Wills <jo...@gmail.com>.
Yes please.
On Jul 2, 2014 10:21 AM, "Christian Tzolov" <ch...@gmail.com>
wrote:

> cool :) What is the best way to continue? open a new Jira ticket for it?
>
>
> On Tue, Jul 1, 2014 at 3:22 PM, Josh Wills <jw...@cloudera.com> wrote:
>
> > +1-- very cool. :)
> >
> >
> > On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <ga...@gmail.com>
> > wrote:
> >
> > > Hey Christian,
> > >
> > > This looks awesome! There have been a bunch of times when I've been
> > > digging around in the planner and wanting to have something like this,
> > > so yes, I definitely think this is useful to have.
> > >
> > > - Gabriel
> > >
> > >
> > > On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
> > > <ch...@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > While exploring the Crunch MR execution flow I decided to augment the
> > > > excellent pipeline DOT diagram with few additional visualizations of
> > some
> > > > interesting (for me) internal/intermediate pipeline preparation
> states.
> > > > Such like the output-pcollection-targets structure (used for the
> > pipeline
> > > > planning), the Graphs before and after the split up of dependent GBK
> > > nodes
> > > > and the RTNode hierarchy as persistent in the Configuration before
> the
> > > > execution of the pipeline.
> > > > For each diagram I've plotted some relevant internals like the PTypte
> > > > structures. The implementation hack includes 3 additional
> > DotfileWriters
> > > > hooked inside the MSCRPlanner#plan() to intercept the flow.
> > > >
> > > > An example of the diagrams generated from the
> > > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked
> > > below.
> > > >
> > > > Do we need such internals visualization? Something like visualization
> > of
> > > > the logical, mapping and physical (e.g. RTNodes) plans of the
> pipeline
> > > > preparation?  What do you think?
> > > >
> > > > Cheers,
> > > > Christian
> > > >
> > > >
> > > > Diagrams generated from the
> > > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> > > >
> > > > - Dotfile containing all graphs:
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
> > > >
> > > >
> > > > 1.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> > > > - is the existing diagram. It provides very well balanced view of the
> > > > pipeline, showing how the functional blocks are mapped into execution
> > > > Map/Reduce components and the dependencies between them.
> > > >
> > > > 2.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> > > > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>>
> outputs)
> > > in
> > > > the MSCRPlanner on plan() operation is execution:
> > > > - Each data flow is depicted with different color to indicate the
> > > > overlapping execution paths.
> > > > - The PCollection name, class and PTypes are shown.
> > > >
> > > > 3.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> > > > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan()
> method.
> > > It
> > > > draws the Vertices with their names, pcollection and ptype. The arc
> > label
> > > > lists the Graph's edge path lists.
> > > >
> > > > 4.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> > > > - Graph created in the MSCRPlanner#plan() after the splits up of
> > > dependent
> > > > GBK nodes and break the graph up into connected components - bounded
> by
> > > > read dashed line.
> > > >
> > > > 5.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> > > > - Visualizes the RTNodes ussed inside the CrunchMapper and
> > CrunchReducer
> > > as
> > > > well as the Inputs and Outputs.
> > > > - RTNodes are  deserialized from the Job's
> > > > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped
> > to
> > > > the containing Map or Reduce tasks and parent Crunch Job. The
> > > relationship
> > > > between RTNodes (e.g. parent/children)  is depicted with arrows.
> > > > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into
> > Map<String,
> > > > OutputConfig> and depicted in the magenta subgraph
> > > > - Inputs are deserialized from the CRUNCH_INPUTS into
> Map<FormatBundle,
> > > > Map<Integer, List<Path>>> and depicted in green subgraph
> > > > - The inputs are mapped to the corresponding RTNode using the
> nodeIndex
> > > > reference.
> > > > - Outputs are mapped to the corresponding RTNode by the Output Name
> > > > references
> > > > - There is not good way to print the anonymous DoFn instances.
> > > > - Note: the dependency between the crunch jobs is not drawn as it my
> > > > require access to the competition hook attributes.
> > > > - Note: in order to draw the RTNodes i had to expose its attributes
> via
> > > > public getters.
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>

Re: Visualizing internal pipeline preparation stages

Posted by Christian Tzolov <ch...@gmail.com>.
cool :) What is the best way to continue? open a new Jira ticket for it?


On Tue, Jul 1, 2014 at 3:22 PM, Josh Wills <jw...@cloudera.com> wrote:

> +1-- very cool. :)
>
>
> On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <ga...@gmail.com>
> wrote:
>
> > Hey Christian,
> >
> > This looks awesome! There have been a bunch of times when I've been
> > digging around in the planner and wanting to have something like this,
> > so yes, I definitely think this is useful to have.
> >
> > - Gabriel
> >
> >
> > On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
> > <ch...@gmail.com> wrote:
> > > Hi,
> > >
> > > While exploring the Crunch MR execution flow I decided to augment the
> > > excellent pipeline DOT diagram with few additional visualizations of
> some
> > > interesting (for me) internal/intermediate pipeline preparation states.
> > > Such like the output-pcollection-targets structure (used for the
> pipeline
> > > planning), the Graphs before and after the split up of dependent GBK
> > nodes
> > > and the RTNode hierarchy as persistent in the Configuration before the
> > > execution of the pipeline.
> > > For each diagram I've plotted some relevant internals like the PTypte
> > > structures. The implementation hack includes 3 additional
> DotfileWriters
> > > hooked inside the MSCRPlanner#plan() to intercept the flow.
> > >
> > > An example of the diagrams generated from the
> > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked
> > below.
> > >
> > > Do we need such internals visualization? Something like visualization
> of
> > > the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
> > > preparation?  What do you think?
> > >
> > > Cheers,
> > > Christian
> > >
> > >
> > > Diagrams generated from the
> > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> > >
> > > - Dotfile containing all graphs:
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
> > >
> > >
> > > 1.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> > > - is the existing diagram. It provides very well balanced view of the
> > > pipeline, showing how the functional blocks are mapped into execution
> > > Map/Reduce components and the dependencies between them.
> > >
> > > 2.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> > > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs)
> > in
> > > the MSCRPlanner on plan() operation is execution:
> > > - Each data flow is depicted with different color to indicate the
> > > overlapping execution paths.
> > > - The PCollection name, class and PTypes are shown.
> > >
> > > 3.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> > > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method.
> > It
> > > draws the Vertices with their names, pcollection and ptype. The arc
> label
> > > lists the Graph's edge path lists.
> > >
> > > 4.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> > > - Graph created in the MSCRPlanner#plan() after the splits up of
> > dependent
> > > GBK nodes and break the graph up into connected components - bounded by
> > > read dashed line.
> > >
> > > 5.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> > > - Visualizes the RTNodes ussed inside the CrunchMapper and
> CrunchReducer
> > as
> > > well as the Inputs and Outputs.
> > > - RTNodes are  deserialized from the Job's
> > > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped
> to
> > > the containing Map or Reduce tasks and parent Crunch Job. The
> > relationship
> > > between RTNodes (e.g. parent/children)  is depicted with arrows.
> > > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into
> Map<String,
> > > OutputConfig> and depicted in the magenta subgraph
> > > - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
> > > Map<Integer, List<Path>>> and depicted in green subgraph
> > > - The inputs are mapped to the corresponding RTNode using the nodeIndex
> > > reference.
> > > - Outputs are mapped to the corresponding RTNode by the Output Name
> > > references
> > > - There is not good way to print the anonymous DoFn instances.
> > > - Note: the dependency between the crunch jobs is not drawn as it my
> > > require access to the competition hook attributes.
> > > - Note: in order to draw the RTNodes i had to expose its attributes via
> > > public getters.
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Visualizing internal pipeline preparation stages

Posted by Josh Wills <jw...@cloudera.com>.
+1-- very cool. :)


On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <ga...@gmail.com> wrote:

> Hey Christian,
>
> This looks awesome! There have been a bunch of times when I've been
> digging around in the planner and wanting to have something like this,
> so yes, I definitely think this is useful to have.
>
> - Gabriel
>
>
> On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
> <ch...@gmail.com> wrote:
> > Hi,
> >
> > While exploring the Crunch MR execution flow I decided to augment the
> > excellent pipeline DOT diagram with few additional visualizations of some
> > interesting (for me) internal/intermediate pipeline preparation states.
> > Such like the output-pcollection-targets structure (used for the pipeline
> > planning), the Graphs before and after the split up of dependent GBK
> nodes
> > and the RTNode hierarchy as persistent in the Configuration before the
> > execution of the pipeline.
> > For each diagram I've plotted some relevant internals like the PTypte
> > structures. The implementation hack includes 3 additional DotfileWriters
> > hooked inside the MSCRPlanner#plan() to intercept the flow.
> >
> > An example of the diagrams generated from the
> > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked
> below.
> >
> > Do we need such internals visualization? Something like visualization of
> > the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
> > preparation?  What do you think?
> >
> > Cheers,
> > Christian
> >
> >
> > Diagrams generated from the
> > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> >
> > - Dotfile containing all graphs:
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
> >
> >
> > 1.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> > - is the existing diagram. It provides very well balanced view of the
> > pipeline, showing how the functional blocks are mapped into execution
> > Map/Reduce components and the dependencies between them.
> >
> > 2.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs)
> in
> > the MSCRPlanner on plan() operation is execution:
> > - Each data flow is depicted with different color to indicate the
> > overlapping execution paths.
> > - The PCollection name, class and PTypes are shown.
> >
> > 3.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method.
> It
> > draws the Vertices with their names, pcollection and ptype. The arc label
> > lists the Graph's edge path lists.
> >
> > 4.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> > - Graph created in the MSCRPlanner#plan() after the splits up of
> dependent
> > GBK nodes and break the graph up into connected components - bounded by
> > read dashed line.
> >
> > 5.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> > - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer
> as
> > well as the Inputs and Outputs.
> > - RTNodes are  deserialized from the Job's
> > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to
> > the containing Map or Reduce tasks and parent Crunch Job. The
> relationship
> > between RTNodes (e.g. parent/children)  is depicted with arrows.
> > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String,
> > OutputConfig> and depicted in the magenta subgraph
> > - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
> > Map<Integer, List<Path>>> and depicted in green subgraph
> > - The inputs are mapped to the corresponding RTNode using the nodeIndex
> > reference.
> > - Outputs are mapped to the corresponding RTNode by the Output Name
> > references
> > - There is not good way to print the anonymous DoFn instances.
> > - Note: the dependency between the crunch jobs is not drawn as it my
> > require access to the competition hook attributes.
> > - Note: in order to draw the RTNodes i had to expose its attributes via
> > public getters.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Visualizing internal pipeline preparation stages

Posted by Gabriel Reid <ga...@gmail.com>.
Hey Christian,

This looks awesome! There have been a bunch of times when I've been
digging around in the planner and wanting to have something like this,
so yes, I definitely think this is useful to have.

- Gabriel


On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
<ch...@gmail.com> wrote:
> Hi,
>
> While exploring the Crunch MR execution flow I decided to augment the
> excellent pipeline DOT diagram with few additional visualizations of some
> interesting (for me) internal/intermediate pipeline preparation states.
> Such like the output-pcollection-targets structure (used for the pipeline
> planning), the Graphs before and after the split up of dependent GBK nodes
> and the RTNode hierarchy as persistent in the Configuration before the
> execution of the pipeline.
> For each diagram I've plotted some relevant internals like the PTypte
> structures. The implementation hack includes 3 additional DotfileWriters
> hooked inside the MSCRPlanner#plan() to intercept the flow.
>
> An example of the diagrams generated from the
> org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked below.
>
> Do we need such internals visualization? Something like visualization of
> the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
> preparation?  What do you think?
>
> Cheers,
> Christian
>
>
> Diagrams generated from the
> org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
>
> - Dotfile containing all graphs:
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
>
>
> 1.
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> - is the existing diagram. It provides very well balanced view of the
> pipeline, showing how the functional blocks are mapped into execution
> Map/Reduce components and the dependencies between them.
>
> 2.
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs) in
> the MSCRPlanner on plan() operation is execution:
> - Each data flow is depicted with different color to indicate the
> overlapping execution paths.
> - The PCollection name, class and PTypes are shown.
>
> 3.
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. It
> draws the Vertices with their names, pcollection and ptype. The arc label
> lists the Graph's edge path lists.
>
> 4.
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> - Graph created in the MSCRPlanner#plan() after the splits up of dependent
> GBK nodes and break the graph up into connected components - bounded by
> read dashed line.
>
> 5.
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer as
> well as the Inputs and Outputs.
> - RTNodes are  deserialized from the Job's
> CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to
> the containing Map or Reduce tasks and parent Crunch Job. The relationship
> between RTNodes (e.g. parent/children)  is depicted with arrows.
> - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String,
> OutputConfig> and depicted in the magenta subgraph
> - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
> Map<Integer, List<Path>>> and depicted in green subgraph
> - The inputs are mapped to the corresponding RTNode using the nodeIndex
> reference.
> - Outputs are mapped to the corresponding RTNode by the Output Name
> references
> - There is not good way to print the anonymous DoFn instances.
> - Note: the dependency between the crunch jobs is not drawn as it my
> require access to the competition hook attributes.
> - Note: in order to draw the RTNodes i had to expose its attributes via
> public getters.