You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Evan McClain <ae...@gmail.com> on 2016/01/14 18:08:03 UTC

Custom ReadableSource implementations that reuse a PType?

Hi list,

I am trying to implement a custom ReadableSource by extending
FileSourceImpl, but it doesn't look like my source is actually being
used since I am reusing AvroTypes (basically just trying to pass
through some avro getMeta() values).

It looks like cache() -> materialize() -> ptype.getDefaultFileSource()
is being used to perform the actual read (even though pipeline.read()
is being passed my custom Source).

Is there any way to do this without also implementing a custom PType?

Thanks
-- 
Evan McClain
https://keybase.io/aeroevan

Re: Custom ReadableSource implementations that reuse a PType?

Posted by Evan McClain <ae...@gmail.com>.
The PCollection that is being cached is the result of a parallelDo (so
it is a DoCollection). The parent InputCollection has the source I
defined. Since the first read is triggered by that .cache() call on the
DoCollection, the file source given in the InputCollection is never
actually used.

Stepping through a debugger shows that even after using
pipeline.read(...).cache(), my ReadableSource.read() doesn't get called
even though I see my source in outputTargetsToMaterialize.value.source.
I only see AvroFileSourceTarget performing the reads so I'm probably
missing something (I'm new to crunch).

Any pointers would be appreciated.

Evan

On Thu, 2016-01-14 at 15:29 -0800, Josh Wills wrote:
> Hey Evan,
> 
> So I must be missing some context here-- it looks to me like
> MRPipeline.getMaterializeSourceTarget first checks to see if you
> passed in
> an input collection type and then looks at its Source to see if it
> implements ReadableSource and uses that if it's available before
> defaulting
> to the ptype's default file source later on in the file via
> createIntermediateOutput. Is the PCollection you're trying to
> materialize
> not an input collection? e.g., is it an input collection that has had
> parallelDo called it on one or more times?
> 
> J
> 
> On Thu, Jan 14, 2016 at 9:08 AM, Evan McClain <ae...@gmail.com>
> wrote:
> 
> > Hi list,
> > 
> > I am trying to implement a custom ReadableSource by extending
> > FileSourceImpl, but it doesn't look like my source is actually
> > being
> > used since I am reusing AvroTypes (basically just trying to pass
> > through some avro getMeta() values).
> > 
> > It looks like cache() -> materialize() ->
> > ptype.getDefaultFileSource()
> > is being used to perform the actual read (even though
> > pipeline.read()
> > is being passed my custom Source).
> > 
> > Is there any way to do this without also implementing a custom
> > PType?
> > 
> > Thanks
> > --
> > Evan McClain
> > https://keybase.io/aeroevan
> > 
-- 
Evan McClain
https://keybase.io/aeroevan

Re: Custom ReadableSource implementations that reuse a PType?

Posted by Josh Wills <jo...@gmail.com>.
Hey Evan,

So I must be missing some context here-- it looks to me like
MRPipeline.getMaterializeSourceTarget first checks to see if you passed in
an input collection type and then looks at its Source to see if it
implements ReadableSource and uses that if it's available before defaulting
to the ptype's default file source later on in the file via
createIntermediateOutput. Is the PCollection you're trying to materialize
not an input collection? e.g., is it an input collection that has had
parallelDo called it on one or more times?

J

On Thu, Jan 14, 2016 at 9:08 AM, Evan McClain <ae...@gmail.com> wrote:

> Hi list,
>
> I am trying to implement a custom ReadableSource by extending
> FileSourceImpl, but it doesn't look like my source is actually being
> used since I am reusing AvroTypes (basically just trying to pass
> through some avro getMeta() values).
>
> It looks like cache() -> materialize() -> ptype.getDefaultFileSource()
> is being used to perform the actual read (even though pipeline.read()
> is being passed my custom Source).
>
> Is there any way to do this without also implementing a custom PType?
>
> Thanks
> --
> Evan McClain
> https://keybase.io/aeroevan
>