You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Ricky Ho <rh...@adobe.com> on 2009/05/25 17:33:10 UTC

TF/IDF algorithm coded in PIG

I am a new comer and just start to look into PIG seriously, I am pretty impressed with its language model …

Given that PIG is just slightly slower than the native Hadoop (I remember Alan mention 20% somewhere), I start question myself why people would code the map() and reduce() function directly in Java (or other languages in Hadoop Streaming) but not everything in PIG.  PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP .. etc) and also give sufficient control back to the programmer (which purely declarative approach like Hive doesn’t).

Can anyone share their opinion/view on this ?

To double confirm that PIG is powerful enough, I plan to code some classical data intensive processing and Machine Learning algorithm in PIG.

Here is the first one:  TF-IDF (term frequency inverse document frequency)
http://horicky.blogspot.com/2009/01/solving-tf-idf-using-map-reduce.html

Love to hear comments/suggestions if the PIG code can be further improved.

Rgds,
Ricky

Re: TF/IDF algorithm coded in PIG

Posted by Ted Dunning <te...@gmail.com>.
I have found pig useful and easy for developers to pick up.

It doesn't always provide enough flexibility and because it isn't a real
scripting language you wind up with a very fragmented program.

It also has difficulties because it is limited in the number of hadoop
versions that are supported.  Somebody will be unhappy no matter what with
this limitation.

As far as overhead is concerned, there are many simple programs where pig
exceeds the speed that you would normally get from map-reduce programs as
coded by hand.  This is because pig is free to collapse filters together in
a way that would be difficult to maintain in primitive map-reduce land.  It
isn't that you *can't* write the same program in primitive map-reduce, it is
just that you *wouldn't*.

Another system that has comparable capabilities (and more) is Cascading.

On Mon, May 25, 2009 at 8:33 AM, Ricky Ho <rh...@adobe.com> wrote:

> Can anyone share their opinion/view on this ?




-- 
Ted Dunning, CTO
DeepDyve

Re: TF/IDF algorithm coded in PIG

Posted by Alan Gates <ga...@yahoo-inc.com>.
Just to clarify, I said 20% over hadoop was our goal.  Currently pig's  
50% over hadoop on trunk (based on the pig mix test).

But I'm glad to see the semantics fit your needs and am interested to  
hear people's feedback.  Thanks for sharing your experience.

Alan.

On May 25, 2009, at 8:33 AM, Ricky Ho wrote:

> I am a new comer and just start to look into PIG seriously, I am  
> pretty impressed with its language model …
>
> Given that PIG is just slightly slower than the native Hadoop (I  
> remember Alan mention 20% somewhere), I start question myself why  
> people would code the map() and reduce() function directly in Java  
> (or other languages in Hadoop Streaming) but not everything in PIG.   
> PIG seems to give the necessary parallel programming construct  
> (FOREACH, FLATTEN, COGROUP .. etc) and also give sufficient control  
> back to the programmer (which purely declarative approach like Hive  
> doesn’t).
>
> Can anyone share their opinion/view on this ?
>
> To double confirm that PIG is powerful enough, I plan to code some  
> classical data intensive processing and Machine Learning algorithm  
> in PIG.
>
> Here is the first one:  TF-IDF (term frequency inverse document  
> frequency)
> http://horicky.blogspot.com/2009/01/solving-tf-idf-using-map-reduce.html
>
> Love to hear comments/suggestions if the PIG code can be further  
> improved.
>
> Rgds,
> Ricky