You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Victor Iacoban <vi...@gmail.com> on 2012/11/15 03:18:33 UTC

extending crunch

Hi,

I'm very interested in writing a wrapper library around Apache Crunch for
Clojure, something similar to existing Scrunch.
How do you recommend to start?

I was looking through Crunch code and it looks like I can pretty easily
integrate it in clojure by adding some custom WritableType type.
Something like WritableType<Object, ByteWritable> with a custom converter
or inputFn/outputFn functions.

Regretfully there are several issues with this approach and instead I'd
have to duplicate all those type classes for a new type set
* WritableType has a package visible constructor so I cannot extend it and
cannot instantiate it
* Converter is instantiated inside WritableType constructor so in case I
need a different converter I'm stuck
* Writables has a factory method for WritableType but it's private
* it looks like there is an attempt to support additional WritableTypes
through EXTENSIONS in Writables but it would only work for cases where in
WritableType<T, W> both T and W are hadoop writables

So what do you think is a best solution, is it possible to open up the api
to support custom WritableTypes or the only option for me is to implement a
new ClojurePType and all related classes?

Hope I'm not too detailed, but at this stage you all are probably very
familiar with the code

Thanks,
Victor

Re: extending crunch

Posted by Josh Wills <jw...@cloudera.com>.
I'm not an Clojure expert by any means, but that looks really nice.

On Fri, Nov 16, 2012 at 1:49 PM, Victor Iacoban
<vi...@gmail.com> wrote:
> with tentative name of clutch, I've put together 2 examples on how clutch
> would look like
> I should say that this might be too aggressive on how easy it is, but looks
> doable with crunch api:
>
> Word Count from crunch examples:
> https://gist.github.com/4090665
>
> Average Bytes By IP from crunch examples:
> https://gist.github.com/4091142
>
> Know any clojure experts interested to comment, pls forward
>
> Thanks
>
>
>
> On Thu, Nov 15, 2012 at 12:27 PM, Matthias Friedrich <ma...@mafr.de> wrote:
>
>> On Thursday, 2012-11-15, Josh Wills wrote:
>> > On Thu, Nov 15, 2012 at 8:14 AM, Victor Iacoban
>> > <vi...@gmail.com> wrote:
>>
>> >> "clunch" sounds like a good name to me ;)
>>
>> > LOL. "Clutch" has a nice ring to it. ;-)
>>
>> We could do even worse with "Crunjure" :)
>>
>> Regards,
>>   Matthias
>>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: extending crunch

Posted by Victor Iacoban <vi...@gmail.com>.
with tentative name of clutch, I've put together 2 examples on how clutch
would look like
I should say that this might be too aggressive on how easy it is, but looks
doable with crunch api:

Word Count from crunch examples:
https://gist.github.com/4090665

Average Bytes By IP from crunch examples:
https://gist.github.com/4091142

Know any clojure experts interested to comment, pls forward

Thanks



On Thu, Nov 15, 2012 at 12:27 PM, Matthias Friedrich <ma...@mafr.de> wrote:

> On Thursday, 2012-11-15, Josh Wills wrote:
> > On Thu, Nov 15, 2012 at 8:14 AM, Victor Iacoban
> > <vi...@gmail.com> wrote:
>
> >> "clunch" sounds like a good name to me ;)
>
> > LOL. "Clutch" has a nice ring to it. ;-)
>
> We could do even worse with "Crunjure" :)
>
> Regards,
>   Matthias
>

Re: extending crunch

Posted by Matthias Friedrich <ma...@mafr.de>.
On Thursday, 2012-11-15, Josh Wills wrote:
> On Thu, Nov 15, 2012 at 8:14 AM, Victor Iacoban
> <vi...@gmail.com> wrote:
 
>> "clunch" sounds like a good name to me ;)
 
> LOL. "Clutch" has a nice ring to it. ;-)

We could do even worse with "Crunjure" :)
 
Regards,
  Matthias

Re: extending crunch

Posted by Josh Wills <jw...@cloudera.com>.
On Thu, Nov 15, 2012 at 8:14 AM, Victor Iacoban
<vi...@gmail.com> wrote:
> I'm not a clojure wizard myself but it feels like clojure REPL with crunch
> would be a terrific experimentation environment.
>
> I've tried crunch from java and I was impressed, it's very easy to connect
> non-standard sources and reasonable easy to define the flow.
>
> I tried to use cascalog for my prototyping env but although it's very good
> on flow definition, cascading lacks a lot in flexibility when you need to
> process something else except for text or sequesnce files.
>
> "clunch" sounds like a good name to me ;)

LOL. "Clutch" has a nice ring to it. ;-)

>
> -- victor
>
>
> On Thu, Nov 15, 2012 at 10:58 AM, Joseph Adler <jo...@gmail.com>wrote:
>
>> Personally, I'd love to see Crunch mixed with Clojure. I was thinking about
>> this myself, but I'd rather see someone who really knows Clojure take this
>> on.
>>
>> Just don't call it Clunch.
>>
>> -- Joe
>>
>>
>> On Thu, Nov 15, 2012 at 5:04 AM, Victor Iacoban <victor.iacoban@gmail.com
>> >wrote:
>>
>> > Thanks Josh, will give this a try
>> >
>> >
>> > On Wed, Nov 14, 2012 at 9:54 PM, Josh Wills <jo...@gmail.com>
>> wrote:
>> >
>> > > I'm always glad to help people to extend Crunch in ways that are useful
>> > for
>> > > them. I think that most things that involve type-related extensions can
>> > be
>> > > handled using the PTypes.derived() function, which can be used to
>> create
>> > > custom PTypes that are mapped to underlying serialized types, so that
>> you
>> > > could do something like
>> > >
>> > > // Forgive my syntax errors, I'm doing this w/o an IDE
>> > > PType<Object> objectType = PTypes.derived(Object.class, new
>> > > InputMapFn<BytesWritable, Object>(), new OutputMapFn<Object,
>> > > BytesWritable>(), Writables.writables(BytesWritable.class));
>> > >
>> > > ...which is essentially how Scrunch works: the PTypes { } functionality
>> > in
>> > > Scrunch maps from Scala types to Java types using the derived
>> > > functionality.
>> > >
>> > > The Converter stuff is internal to Avro and Writable, I can't think of
>> a
>> > > case where that would need to be exposed outside the package (i.e.,
>> once
>> > > you've decided on whether to use Writables or Avro as your
>> serialization
>> > > framework, the choice of Converter is fixed.)
>> > >
>> > > If you have a use case where the derived type can't handle the
>> conversion
>> > > or is a poor choice for whatever reason, I'm all about having a
>> > discussion
>> > > and trying out different designs.
>> > >
>> > > Josh
>> > >
>> > >
>> > > On Wed, Nov 14, 2012 at 6:18 PM, Victor Iacoban <
>> > victor.iacoban@gmail.com
>> > > >wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I'm very interested in writing a wrapper library around Apache Crunch
>> > for
>> > > > Clojure, something similar to existing Scrunch.
>> > > > How do you recommend to start?
>> > > >
>> > > > I was looking through Crunch code and it looks like I can pretty
>> easily
>> > > > integrate it in clojure by adding some custom WritableType type.
>> > > > Something like WritableType<Object, ByteWritable> with a custom
>> > converter
>> > > > or inputFn/outputFn functions.
>> > > >
>> > > > Regretfully there are several issues with this approach and instead
>> I'd
>> > > > have to duplicate all those type classes for a new type set
>> > > > * WritableType has a package visible constructor so I cannot extend
>> it
>> > > and
>> > > > cannot instantiate it
>> > > > * Converter is instantiated inside WritableType constructor so in
>> case
>> > I
>> > > > need a different converter I'm stuck
>> > > > * Writables has a factory method for WritableType but it's private
>> > > > * it looks like there is an attempt to support additional
>> WritableTypes
>> > > > through EXTENSIONS in Writables but it would only work for cases
>> where
>> > in
>> > > > WritableType<T, W> both T and W are hadoop writables
>> > > >
>> > > > So what do you think is a best solution, is it possible to open up
>> the
>> > > api
>> > > > to support custom WritableTypes or the only option for me is to
>> > > implement a
>> > > > new ClojurePType and all related classes?
>> > > >
>> > > > Hope I'm not too detailed, but at this stage you all are probably
>> very
>> > > > familiar with the code
>> > > >
>> > > > Thanks,
>> > > > Victor
>> > > >
>> > >
>> >
>>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: extending crunch

Posted by Victor Iacoban <vi...@gmail.com>.
I'm not a clojure wizard myself but it feels like clojure REPL with crunch
would be a terrific experimentation environment.

I've tried crunch from java and I was impressed, it's very easy to connect
non-standard sources and reasonable easy to define the flow.

I tried to use cascalog for my prototyping env but although it's very good
on flow definition, cascading lacks a lot in flexibility when you need to
process something else except for text or sequesnce files.

"clunch" sounds like a good name to me ;)

-- victor


On Thu, Nov 15, 2012 at 10:58 AM, Joseph Adler <jo...@gmail.com>wrote:

> Personally, I'd love to see Crunch mixed with Clojure. I was thinking about
> this myself, but I'd rather see someone who really knows Clojure take this
> on.
>
> Just don't call it Clunch.
>
> -- Joe
>
>
> On Thu, Nov 15, 2012 at 5:04 AM, Victor Iacoban <victor.iacoban@gmail.com
> >wrote:
>
> > Thanks Josh, will give this a try
> >
> >
> > On Wed, Nov 14, 2012 at 9:54 PM, Josh Wills <jo...@gmail.com>
> wrote:
> >
> > > I'm always glad to help people to extend Crunch in ways that are useful
> > for
> > > them. I think that most things that involve type-related extensions can
> > be
> > > handled using the PTypes.derived() function, which can be used to
> create
> > > custom PTypes that are mapped to underlying serialized types, so that
> you
> > > could do something like
> > >
> > > // Forgive my syntax errors, I'm doing this w/o an IDE
> > > PType<Object> objectType = PTypes.derived(Object.class, new
> > > InputMapFn<BytesWritable, Object>(), new OutputMapFn<Object,
> > > BytesWritable>(), Writables.writables(BytesWritable.class));
> > >
> > > ...which is essentially how Scrunch works: the PTypes { } functionality
> > in
> > > Scrunch maps from Scala types to Java types using the derived
> > > functionality.
> > >
> > > The Converter stuff is internal to Avro and Writable, I can't think of
> a
> > > case where that would need to be exposed outside the package (i.e.,
> once
> > > you've decided on whether to use Writables or Avro as your
> serialization
> > > framework, the choice of Converter is fixed.)
> > >
> > > If you have a use case where the derived type can't handle the
> conversion
> > > or is a poor choice for whatever reason, I'm all about having a
> > discussion
> > > and trying out different designs.
> > >
> > > Josh
> > >
> > >
> > > On Wed, Nov 14, 2012 at 6:18 PM, Victor Iacoban <
> > victor.iacoban@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm very interested in writing a wrapper library around Apache Crunch
> > for
> > > > Clojure, something similar to existing Scrunch.
> > > > How do you recommend to start?
> > > >
> > > > I was looking through Crunch code and it looks like I can pretty
> easily
> > > > integrate it in clojure by adding some custom WritableType type.
> > > > Something like WritableType<Object, ByteWritable> with a custom
> > converter
> > > > or inputFn/outputFn functions.
> > > >
> > > > Regretfully there are several issues with this approach and instead
> I'd
> > > > have to duplicate all those type classes for a new type set
> > > > * WritableType has a package visible constructor so I cannot extend
> it
> > > and
> > > > cannot instantiate it
> > > > * Converter is instantiated inside WritableType constructor so in
> case
> > I
> > > > need a different converter I'm stuck
> > > > * Writables has a factory method for WritableType but it's private
> > > > * it looks like there is an attempt to support additional
> WritableTypes
> > > > through EXTENSIONS in Writables but it would only work for cases
> where
> > in
> > > > WritableType<T, W> both T and W are hadoop writables
> > > >
> > > > So what do you think is a best solution, is it possible to open up
> the
> > > api
> > > > to support custom WritableTypes or the only option for me is to
> > > implement a
> > > > new ClojurePType and all related classes?
> > > >
> > > > Hope I'm not too detailed, but at this stage you all are probably
> very
> > > > familiar with the code
> > > >
> > > > Thanks,
> > > > Victor
> > > >
> > >
> >
>

Re: extending crunch

Posted by Joseph Adler <jo...@gmail.com>.
Personally, I'd love to see Crunch mixed with Clojure. I was thinking about
this myself, but I'd rather see someone who really knows Clojure take this
on.

Just don't call it Clunch.

-- Joe


On Thu, Nov 15, 2012 at 5:04 AM, Victor Iacoban <vi...@gmail.com>wrote:

> Thanks Josh, will give this a try
>
>
> On Wed, Nov 14, 2012 at 9:54 PM, Josh Wills <jo...@gmail.com> wrote:
>
> > I'm always glad to help people to extend Crunch in ways that are useful
> for
> > them. I think that most things that involve type-related extensions can
> be
> > handled using the PTypes.derived() function, which can be used to create
> > custom PTypes that are mapped to underlying serialized types, so that you
> > could do something like
> >
> > // Forgive my syntax errors, I'm doing this w/o an IDE
> > PType<Object> objectType = PTypes.derived(Object.class, new
> > InputMapFn<BytesWritable, Object>(), new OutputMapFn<Object,
> > BytesWritable>(), Writables.writables(BytesWritable.class));
> >
> > ...which is essentially how Scrunch works: the PTypes { } functionality
> in
> > Scrunch maps from Scala types to Java types using the derived
> > functionality.
> >
> > The Converter stuff is internal to Avro and Writable, I can't think of a
> > case where that would need to be exposed outside the package (i.e., once
> > you've decided on whether to use Writables or Avro as your serialization
> > framework, the choice of Converter is fixed.)
> >
> > If you have a use case where the derived type can't handle the conversion
> > or is a poor choice for whatever reason, I'm all about having a
> discussion
> > and trying out different designs.
> >
> > Josh
> >
> >
> > On Wed, Nov 14, 2012 at 6:18 PM, Victor Iacoban <
> victor.iacoban@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I'm very interested in writing a wrapper library around Apache Crunch
> for
> > > Clojure, something similar to existing Scrunch.
> > > How do you recommend to start?
> > >
> > > I was looking through Crunch code and it looks like I can pretty easily
> > > integrate it in clojure by adding some custom WritableType type.
> > > Something like WritableType<Object, ByteWritable> with a custom
> converter
> > > or inputFn/outputFn functions.
> > >
> > > Regretfully there are several issues with this approach and instead I'd
> > > have to duplicate all those type classes for a new type set
> > > * WritableType has a package visible constructor so I cannot extend it
> > and
> > > cannot instantiate it
> > > * Converter is instantiated inside WritableType constructor so in case
> I
> > > need a different converter I'm stuck
> > > * Writables has a factory method for WritableType but it's private
> > > * it looks like there is an attempt to support additional WritableTypes
> > > through EXTENSIONS in Writables but it would only work for cases where
> in
> > > WritableType<T, W> both T and W are hadoop writables
> > >
> > > So what do you think is a best solution, is it possible to open up the
> > api
> > > to support custom WritableTypes or the only option for me is to
> > implement a
> > > new ClojurePType and all related classes?
> > >
> > > Hope I'm not too detailed, but at this stage you all are probably very
> > > familiar with the code
> > >
> > > Thanks,
> > > Victor
> > >
> >
>

Re: extending crunch

Posted by Victor Iacoban <vi...@gmail.com>.
Thanks Josh, will give this a try


On Wed, Nov 14, 2012 at 9:54 PM, Josh Wills <jo...@gmail.com> wrote:

> I'm always glad to help people to extend Crunch in ways that are useful for
> them. I think that most things that involve type-related extensions can be
> handled using the PTypes.derived() function, which can be used to create
> custom PTypes that are mapped to underlying serialized types, so that you
> could do something like
>
> // Forgive my syntax errors, I'm doing this w/o an IDE
> PType<Object> objectType = PTypes.derived(Object.class, new
> InputMapFn<BytesWritable, Object>(), new OutputMapFn<Object,
> BytesWritable>(), Writables.writables(BytesWritable.class));
>
> ...which is essentially how Scrunch works: the PTypes { } functionality in
> Scrunch maps from Scala types to Java types using the derived
> functionality.
>
> The Converter stuff is internal to Avro and Writable, I can't think of a
> case where that would need to be exposed outside the package (i.e., once
> you've decided on whether to use Writables or Avro as your serialization
> framework, the choice of Converter is fixed.)
>
> If you have a use case where the derived type can't handle the conversion
> or is a poor choice for whatever reason, I'm all about having a discussion
> and trying out different designs.
>
> Josh
>
>
> On Wed, Nov 14, 2012 at 6:18 PM, Victor Iacoban <victor.iacoban@gmail.com
> >wrote:
>
> > Hi,
> >
> > I'm very interested in writing a wrapper library around Apache Crunch for
> > Clojure, something similar to existing Scrunch.
> > How do you recommend to start?
> >
> > I was looking through Crunch code and it looks like I can pretty easily
> > integrate it in clojure by adding some custom WritableType type.
> > Something like WritableType<Object, ByteWritable> with a custom converter
> > or inputFn/outputFn functions.
> >
> > Regretfully there are several issues with this approach and instead I'd
> > have to duplicate all those type classes for a new type set
> > * WritableType has a package visible constructor so I cannot extend it
> and
> > cannot instantiate it
> > * Converter is instantiated inside WritableType constructor so in case I
> > need a different converter I'm stuck
> > * Writables has a factory method for WritableType but it's private
> > * it looks like there is an attempt to support additional WritableTypes
> > through EXTENSIONS in Writables but it would only work for cases where in
> > WritableType<T, W> both T and W are hadoop writables
> >
> > So what do you think is a best solution, is it possible to open up the
> api
> > to support custom WritableTypes or the only option for me is to
> implement a
> > new ClojurePType and all related classes?
> >
> > Hope I'm not too detailed, but at this stage you all are probably very
> > familiar with the code
> >
> > Thanks,
> > Victor
> >
>

Re: extending crunch

Posted by Josh Wills <jo...@gmail.com>.
I'm always glad to help people to extend Crunch in ways that are useful for
them. I think that most things that involve type-related extensions can be
handled using the PTypes.derived() function, which can be used to create
custom PTypes that are mapped to underlying serialized types, so that you
could do something like

// Forgive my syntax errors, I'm doing this w/o an IDE
PType<Object> objectType = PTypes.derived(Object.class, new
InputMapFn<BytesWritable, Object>(), new OutputMapFn<Object,
BytesWritable>(), Writables.writables(BytesWritable.class));

...which is essentially how Scrunch works: the PTypes { } functionality in
Scrunch maps from Scala types to Java types using the derived functionality.

The Converter stuff is internal to Avro and Writable, I can't think of a
case where that would need to be exposed outside the package (i.e., once
you've decided on whether to use Writables or Avro as your serialization
framework, the choice of Converter is fixed.)

If you have a use case where the derived type can't handle the conversion
or is a poor choice for whatever reason, I'm all about having a discussion
and trying out different designs.

Josh


On Wed, Nov 14, 2012 at 6:18 PM, Victor Iacoban <vi...@gmail.com>wrote:

> Hi,
>
> I'm very interested in writing a wrapper library around Apache Crunch for
> Clojure, something similar to existing Scrunch.
> How do you recommend to start?
>
> I was looking through Crunch code and it looks like I can pretty easily
> integrate it in clojure by adding some custom WritableType type.
> Something like WritableType<Object, ByteWritable> with a custom converter
> or inputFn/outputFn functions.
>
> Regretfully there are several issues with this approach and instead I'd
> have to duplicate all those type classes for a new type set
> * WritableType has a package visible constructor so I cannot extend it and
> cannot instantiate it
> * Converter is instantiated inside WritableType constructor so in case I
> need a different converter I'm stuck
> * Writables has a factory method for WritableType but it's private
> * it looks like there is an attempt to support additional WritableTypes
> through EXTENSIONS in Writables but it would only work for cases where in
> WritableType<T, W> both T and W are hadoop writables
>
> So what do you think is a best solution, is it possible to open up the api
> to support custom WritableTypes or the only option for me is to implement a
> new ClojurePType and all related classes?
>
> Hope I'm not too detailed, but at this stage you all are probably very
> familiar with the code
>
> Thanks,
> Victor
>