You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Gabriel Reid (JIRA)" <ji...@apache.org> on 2014/07/28 17:53:40 UTC

[jira] [Commented] (CRUNCH-449) Add sequentialDo function for injecting arbitrary non-parallel code

    [ https://issues.apache.org/jira/browse/CRUNCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076326#comment-14076326 ] 

Gabriel Reid commented on CRUNCH-449:
-------------------------------------

Sorry I took so long to take a look at this. Looks interesting -- at first I found it a bit difficult to figure out what exactly it would be used for (and what the advantage is between this and just calling Pipeline.run at some points), but it looks like this opens up a whole lot of other opportunities to indirectly influence the job plan without actually having to worry about how it's exactly done.

I noticed that SeqDoFn.dependsOn(String, PCollection) is called implicitly from PCollectionImpl.sequentialDo , but SeqDoFn.dependsOn(String, Target) always needs to be called explicitly. I guess this makes sense, but maybe it would be handy to change PCollection.sequentialDo to accept a String argument that would be used as the label of the incoming PCollection dependency. I'm thinking that would make it easier to retrieve that PCollection later by name from within the SeqDoFn.

Can the "Output" generic parameter of SeqDoFn be bounded by PCollection (i.e. <Output extends PCollection<?>>), just because that might make documentation things easier? Or is it possible to have a SeqDoFn that is bound to something other than a PCollection?

I noticed that the PCollection class has a commented-out version of the sequentialDo method that needs to be removed.

I know you're probably on top of this, but I'll just point it out anyway: more docs in SeqDoFn, particularly on the abstract methods, would be really good. It's not immediately obvious exactly how it is intended to be used.

Also, more tests demonstrating some more use cases (target isn't created, dependent on multiple targets, dependent on multiple PCollections, dependent on a combination of targets and PCollections) would also be really handy, if only in terms of documenting some use cases for this new functionality.

> Add sequentialDo function for injecting arbitrary non-parallel code
> -------------------------------------------------------------------
>
>                 Key: CRUNCH-449
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-449
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-449.patch, CRUNCH-449b.patch
>
>
> I've been noodling on this one for awhile: how to add the ability to execute some code if and only if one or more targets are created, and have that executed code (optionally) return one or more new PCollections as a result. I was thinking that this functionality could be wired in to libraries to do things like bulk loading HBase tables or running Sqoop jobs as part of Crunch pipelines automatically.



--
This message was sent by Atlassian JIRA
(v6.2#6252)