You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Peter Knap <pk...@yahoo.com> on 2013/03/09 03:53:20 UTC

MultipleOutput in crunch

Hi,

Is multiple output functionality supported by crunch? I have looked at the source code but could find a way to do it. I have the following scenario: input file would be processed by multiple sequential filters, the records passing the filter criteria need to be processed differently than the ones which are not. What's the best way to do it in crunch? I know I can proccess the input data twice by two different fillters but this is not efficient. Any suggestion from you guys?

Thanks,
Piotr

Re: MultipleOutput in crunch

Posted by Josh Wills <jw...@cloudera.com>.

MultipleOutputs is baked in pretty deep to the Crunch system, although we
have our own impl (the class is named CrunchMultipleOutputs) to handle some
of the peculiarities around how we configure OutputFormats.

I would do something similar to what Micah suggested, but I would leave out
the groupByKey step, e.g., I would start with a PCollection<T>, use a MapFn
to convert it to a PCollection<Pair<T, Boolean>> (or equivalently, a
PTable<T, Boolean>) and have each of the filter fns in the sequence check
the current value of the boolean for each record-- if it's already false,
don't bother doing the filter check, just pass along Pair.of(T, false); if
it's true, do the check, and emit Pair.of(T, true) if it passes and
Pair.of(T, false) if it fails. Then, after all of the filter checks are
done, use two FilterFns to route the records that passed the checks
separately from the ones that didn't pass them-- either to subsequent
processing logic, or to separate files, or whatever. If you can get away
with doing everything in a single pass over the data using a map-only job,
that's the best of all worlds from a performance perspective.

Josh

On Fri, Mar 8, 2013 at 8:16 PM, Micah Whitacre <mk...@gmail.com> wrote:

> Instead of implementing a filter could you switch to using a DoFn and
> emit a Pair?  Then the first part of the pair would be the identifier
> for the category of data.  You can then group by key to process them
> differently or just keep processing them by the same DoFn using the
> key as a flag to how to process it.
>
> That being said I'm not really sure this would be any more efficient
> than filtering twice.
>
>
> On Fri, Mar 8, 2013 at 8:53 PM, Peter Knap <pk...@yahoo.com> wrote:
> > Hi,
> >
> > Is multiple output functionality supported by crunch? I have looked at
> the
> > source code but could find a way to do it. I have the following scenario:
> > input file would be processed by multiple sequential filters, the records
> > passing the filter criteria need to be processed differently than the
> ones
> > which are not. What's the best way to do it in crunch? I know I can
> proccess
> > the input data twice by two different fillters but this is not efficient.
> > Any suggestion from you guys?
> >
> > Thanks,
> > Piotr
> >
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: MultipleOutput in crunch

Posted by Micah Whitacre <mk...@gmail.com>.

Instead of implementing a filter could you switch to using a DoFn and
emit a Pair?  Then the first part of the pair would be the identifier
for the category of data.  You can then group by key to process them
differently or just keep processing them by the same DoFn using the
key as a flag to how to process it.

That being said I'm not really sure this would be any more efficient
than filtering twice.

On Fri, Mar 8, 2013 at 8:53 PM, Peter Knap <pk...@yahoo.com> wrote:
> Hi,
>
> Is multiple output functionality supported by crunch? I have looked at the
> source code but could find a way to do it. I have the following scenario:
> input file would be processed by multiple sequential filters, the records
> passing the filter criteria need to be processed differently than the ones
> which are not. What's the best way to do it in crunch? I know I can proccess
> the input data twice by two different fillters but this is not efficient.
> Any suggestion from you guys?
>
> Thanks,
> Piotr
>