You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Karl Wettin <ka...@gmail.com> on 2008/04/04 19:28:12 UTC

Pig for preprocessing in Mahout?

I sent this RTFM-level question to the Pig list the other day but never 
got any response. Anyone in here that could tell me if it makes sense?

http://www.mail-archive.com/pig-user@incubator.apache.org/msg00137.html

 > <http://lucene.apache.org/mahout/> need some formula 1A pre 
processing > filters such as resampling, discretization and what not.
 >
 > Would you agree it sounds about right to do that with Pig?
 >
 > People tell me there are a constant changes to the API of Pig. If this
 > is true, how probable is it that features as them listed above would
 > require a lot of tinkering every time one wants some new juicy feature
 > from your trunk?
 >
 > I don't know, perhaps Pig already can do some of this?


    karl

Re: Pig for preprocessing in Mahout?

Posted by Ted Dunning <td...@veoh.com>.


On 4/4/08 12:39 PM, "Karl Wettin" <ka...@gmail.com> wrote:

> Ted Dunning skrev:
>> 
>> I would say that it would be easier to use a system that has a full
>> extension language such as grool or JAQL than pig.  Resampling and
>> discretization are really pretty straightforward applications of map reduce
>> and should normally be collected as components into a larger composite
>> mapper.
> 
> I was thinking we would use Pig as that larger composite mapper. If we
> wanted to add discretization to Mahout we would then add it to Pig. They
> seem to have a framework to do a lot of the things I want in a pre
> processing module.

I think that Pig would be moderately difficult to integrate as a component
in a full-scale ML framework and it would not be suitable as a map
collector.

You can write functions that operate on bags of records, but the integration
is likely to be pretty clunky and the resilience to errors in the components
would be nil (AFAIK).  Given that the new ML components are likely to be
less than robust, that makes the overall process painful.

> But I don't know Pig enough to say if that could work for all the things
> we might want to do at pre processing time with Mahout.

To the extent that it does, I don't see a problem with using pig to build
datasets and then running ML on those datasets.  Since pig is all about
batch processing, this isn't a big deal.

> 
>> I should also have said that Pig is progressing very quickly.
> 
> When do you think Pig might be "stable"?

No clue.  They have a large amount of invested effort so far and have had a
pretty long time in the current state, but I can't say when they will cross
a magical threshold that makes Pig easy enough to use for most users.  I get
the impression that it is beginning to reach that threshold inside Yahoo
where there is a strong evangelism network, but for outside users without a
strong interest in the internals, I think it is a ways away.
  


Re: Pig for preprocessing in Mahout?

Posted by Karl Wettin <ka...@gmail.com>.
Ted Dunning skrev:
> 
> I would say that it would be easier to use a system that has a full
> extension language such as grool or JAQL than pig.  Resampling and
> discretization are really pretty straightforward applications of map reduce
> and should normally be collected as components into a larger composite
> mapper.

I was thinking we would use Pig as that larger composite mapper. If we 
wanted to add discretization to Mahout we would then add it to Pig. They 
seem to have a framework to do a lot of the things I want in a pre 
processing module.

But I don't know Pig enough to say if that could work for all the things 
we might want to do at pre processing time with Mahout.

 > I should also have said that Pig is progressing very quickly.

When do you think Pig might be "stable"?


    karl

Re: Pig for preprocessing in Mahout?

Posted by Ted Dunning <td...@veoh.com>.
(no caffeine this morning)

Should have been "too sensitive to really minor mistakes for my taste".

I should also have said that Pig is progressing very quickly.


On 4/4/08 10:34 AM, "Ted Dunning" <td...@veoh.com> wrote:

> I haven't tried pig for 2 months or so, but last time I did try it, I found
> it too sensitive to really minor for my taste.


Re: Pig for preprocessing in Mahout?

Posted by Ted Dunning <td...@veoh.com>.

I haven't tried pig for 2 months or so, but last time I did try it, I found
it too sensitive to really minor for my taste.

I currently use hand-written scripts in grool, a system of my own device
that allows simple MR programs to be written simply.

I would say that it would be easier to use a system that has a full
extension language such as grool or JAQL than pig.  Resampling and
discretization are really pretty straightforward applications of map reduce
and should normally be collected as components into a larger composite
mapper.


On 4/4/08 10:28 AM, "Karl Wettin" <ka...@gmail.com> wrote:

> I sent this RTFM-level question to the Pig list the other day but never
> got any response. Anyone in here that could tell me if it makes sense?
> 
> http://www.mail-archive.com/pig-user@incubator.apache.org/msg00137.html
> 
>> <http://lucene.apache.org/mahout/> need some formula 1A pre
> processing > filters such as resampling, discretization and what not.
>> 
>> Would you agree it sounds about right to do that with Pig?
>> 
>> People tell me there are a constant changes to the API of Pig. If this
>> is true, how probable is it that features as them listed above would
>> require a lot of tinkering every time one wants some new juicy feature
>> from your trunk?
>> 
>> I don't know, perhaps Pig already can do some of this?
> 
> 
>     karl