You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Guy Doulberg <Gu...@conduit.com> on 2011/02/16 16:06:24 UTC

DataCreator

Hey all,
I want to consult with you hadoppers about a Map/Reduce application I want to build.

I want to build a map/reduce job, that read files from HDFS, perform some sort of transformation on the file lines, and store them to several partition depending on the source of the file or its data.

I want this application to be as configurable as possible, so I designed interfaces to Parse, Decorate and Partition(On HDFS) the Data.

I want to be able to configure different data flows, with different parsers, decorators and partitioners, using a config file.

Do you think, you would use such an application? Does it fit an open-source project?

Now, I have some technical questions:
I was thinking of using reflection, to load all the classes I would need according to the configuration during the setup process of the Mapper.
Do you think it is a good idea?

Is there a way to send the Mapper objects or interfaces from the Job declaration?



 Thanks,


Re: DataCreator

Posted by Lance Norskog <go...@gmail.com>.
Have a look at 'hamake' on google source.

On Thu, Feb 17, 2011 at 2:27 AM, Guy Doulberg <Gu...@conduit.com> wrote:
> Thank you all for your suggestion,
>
> The suggestions you gave me, are more on the "how" should I develop my app side, and not "what" can I use instead of building an app of my own
>
> Going over Cascalog, Cascading and pig, I didn't find exactly what I need.
>
> I need a batch that periodically runs and samples folders for data, if it finds data there, it takes the data and transforms it according to preset transformations, I want to be able to change the transformations easily. The transformations that the data should go through, are determined by the directory it came from or a pattern in the data itself.
>
> This app sounds very similar to flume, beside the fact it digest the entire data that has arrived in one map/reduce.
>
>
>
> -----Original Message-----
> From: Chris K Wensel [mailto:chris@wensel.net]
> Sent: Wednesday, February 16, 2011 10:18 PM
> To: common-user@hadoop.apache.org
> Subject: Re: DataCreator
>
>> was thinking of using cascading, but cascading, requires me for each change in the data flow, to recompile and deploy. Maybe cascading can be part of the implementation but not the solution.
>
> Cascading is well suited for this.
>
> Multitool was written with Cascading, you can spawn reasonably complex filtering, conversion, and joins from the command line (no recompiling). Amazon promotes this for searching S3 buckets from EMR.
>
> Cascading.JRuby allows you to creating complex jobs from a jruby script, no compiling. Etsy uses this for their web site funnel analysis.
>
> Cascalog is much more sophisticated, and can be driven from a Clojure shell (repl), obviously no compiling there either. Quite a few companies use this to power their analytics and analysis.
>
> all of which can be found here
> http://www.cascading.org/modules.html
>
> And a number of companies have built proprietary web UI's to Hadoop with Cascading as the query planner and processing engine. Some of which will ship as products this year.
>
> fyi, there will be a Cascalog workshop this Saturday (I'll be attending)
> http://www.cascading.org/2011/02/cascalog-workshop-february-19t.html
>
> cheers,
> chris
>
> --
> Chris K Wensel
> chris@concurrentinc.com
> http://www.concurrentinc.com
>



-- 
Lance Norskog
goksron@gmail.com

RE: DataCreator

Posted by Guy Doulberg <Gu...@conduit.com>.
Thank you all for your suggestion,

The suggestions you gave me, are more on the "how" should I develop my app side, and not "what" can I use instead of building an app of my own 

Going over Cascalog, Cascading and pig, I didn't find exactly what I need.

I need a batch that periodically runs and samples folders for data, if it finds data there, it takes the data and transforms it according to preset transformations, I want to be able to change the transformations easily. The transformations that the data should go through, are determined by the directory it came from or a pattern in the data itself.

This app sounds very similar to flume, beside the fact it digest the entire data that has arrived in one map/reduce.



-----Original Message-----
From: Chris K Wensel [mailto:chris@wensel.net] 
Sent: Wednesday, February 16, 2011 10:18 PM
To: common-user@hadoop.apache.org
Subject: Re: DataCreator

> was thinking of using cascading, but cascading, requires me for each change in the data flow, to recompile and deploy. Maybe cascading can be part of the implementation but not the solution.

Cascading is well suited for this.

Multitool was written with Cascading, you can spawn reasonably complex filtering, conversion, and joins from the command line (no recompiling). Amazon promotes this for searching S3 buckets from EMR.

Cascading.JRuby allows you to creating complex jobs from a jruby script, no compiling. Etsy uses this for their web site funnel analysis.

Cascalog is much more sophisticated, and can be driven from a Clojure shell (repl), obviously no compiling there either. Quite a few companies use this to power their analytics and analysis.

all of which can be found here
http://www.cascading.org/modules.html

And a number of companies have built proprietary web UI's to Hadoop with Cascading as the query planner and processing engine. Some of which will ship as products this year.

fyi, there will be a Cascalog workshop this Saturday (I'll be attending)
http://www.cascading.org/2011/02/cascalog-workshop-february-19t.html

cheers,
chris

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

Re: DataCreator

Posted by Chris K Wensel <ch...@wensel.net>.
> was thinking of using cascading, but cascading, requires me for each change in the data flow, to recompile and deploy. Maybe cascading can be part of the implementation but not the solution.

Cascading is well suited for this.

Multitool was written with Cascading, you can spawn reasonably complex filtering, conversion, and joins from the command line (no recompiling). Amazon promotes this for searching S3 buckets from EMR.

Cascading.JRuby allows you to creating complex jobs from a jruby script, no compiling. Etsy uses this for their web site funnel analysis.

Cascalog is much more sophisticated, and can be driven from a Clojure shell (repl), obviously no compiling there either. Quite a few companies use this to power their analytics and analysis.

all of which can be found here
http://www.cascading.org/modules.html

And a number of companies have built proprietary web UI's to Hadoop with Cascading as the query planner and processing engine. Some of which will ship as products this year.

fyi, there will be a Cascalog workshop this Saturday (I'll be attending)
http://www.cascading.org/2011/02/cascalog-workshop-february-19t.html

cheers,
chris

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

RE: DataCreator

Posted by Guy Doulberg <Gu...@conduit.com>.
How can I use Hive for doing that?

I was thinking of using cascading, but cascading, requires me for each change in the data flow, to recompile and deploy. Maybe cascading can be part of the implementation but not the solution.

As for Pig I would need to look how I can use it to achieve the porpuse,

I my vision, a non skilled person would have a Ui, in which he could assign for each source, transformations and partitions.
What I am looking for is very similar to Flume, beside the fact that flume is for event streaming, and what I am looking for, is for chunks of data.



From: Ted Dunning [mailto:tdunning@maprtech.com]
Sent: Wednesday, February 16, 2011 5:19 PM
To: common-user@hadoop.apache.org
Cc: Guy Doulberg
Subject: Re: DataCreator

Sounds like Pig.  Or Cascading.  Or Hive.

Seriously, isn't this already available?
On Wed, Feb 16, 2011 at 7:06 AM, Guy Doulberg <Gu...@conduit.com>> wrote:

Hey all,
I want to consult with you hadoppers about a Map/Reduce application I want to build.

I want to build a map/reduce job, that read files from HDFS, perform some sort of transformation on the file lines, and store them to several partition depending on the source of the file or its data.

I want this application to be as configurable as possible, so I designed interfaces to Parse, Decorate and Partition(On HDFS) the Data.

I want to be able to configure different data flows, with different parsers, decorators and partitioners, using a config file.

Do you think, you would use such an application? Does it fit an open-source project?

Now, I have some technical questions:
I was thinking of using reflection, to load all the classes I would need according to the configuration during the setup process of the Mapper.
Do you think it is a good idea?

Is there a way to send the Mapper objects or interfaces from the Job declaration?



 Thanks,


Re: DataCreator

Posted by Ted Dunning <td...@maprtech.com>.
Sounds like Pig.  Or Cascading.  Or Hive.

Seriously, isn't this already available?

On Wed, Feb 16, 2011 at 7:06 AM, Guy Doulberg <Gu...@conduit.com>wrote:

>
> Hey all,
> I want to consult with you hadoppers about a Map/Reduce application I want
> to build.
>
> I want to build a map/reduce job, that read files from HDFS, perform some
> sort of transformation on the file lines, and store them to several
> partition depending on the source of the file or its data.
>
> I want this application to be as configurable as possible, so I designed
> interfaces to Parse, Decorate and Partition(On HDFS) the Data.
>
> I want to be able to configure different data flows, with different
> parsers, decorators and partitioners, using a config file.
>
> Do you think, you would use such an application? Does it fit an open-source
> project?
>
> Now, I have some technical questions:
> I was thinking of using reflection, to load all the classes I would need
> according to the configuration during the setup process of the Mapper.
> Do you think it is a good idea?
>
> Is there a way to send the Mapper objects or interfaces from the Job
> declaration?
>
>
>
>  Thanks,
>
>