You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/05/27 23:12:39 UTC

M/R Job for Log file to FPG

I'd like to take a bunch of logs and extract a bit of each line and then put them into format for FPG.  Was thinking a simple M/R job that took in a regex would suffice and then output in the format for FPG.  Is that generally useful or am I missing something obvious?  I want to do FPG on my query logs and it seems like a generally useful conversion.  I suppose, in fact, it isn't even log specific.

-Grant

Re: M/R Job for Log file to FPG

Posted by Jake Mannix <ja...@gmail.com>.
Heh, if you're in the mood to accept pokes, I'd say SGD comes first!

On Thu, May 27, 2010 at 5:34 PM, Ted Dunning <te...@gmail.com> wrote:

> Poke accepted.
>
> I am thinking about it.  But I think that I have to get teh classifier
> examples for the book in first.  SGD needs committing too.
>
> On Thu, May 27, 2010 at 4:14 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Sounds great, Ted!  When do you think you'll code that up? ;P
> >
>

Re: M/R Job for Log file to FPG

Posted by Ted Dunning <te...@gmail.com>.
Poke accepted.

I am thinking about it.  But I think that I have to get teh classifier
examples for the book in first.  SGD needs committing too.

On Thu, May 27, 2010 at 4:14 PM, Jake Mannix <ja...@gmail.com> wrote:

> Sounds great, Ted!  When do you think you'll code that up? ;P
>

Re: M/R Job for Log file to FPG

Posted by Jake Mannix <ja...@gmail.com>.
On Thu, May 27, 2010 at 4:06 PM, Ted Dunning <te...@gmail.com> wrote:
>
>
> But once you jump on that slippery slope, why not allow a tiny Groovy
> closure to be injected?  Or to pass in an object that will extract a map of
> values from each line?
>

Sounds great, Ted!  When do you think you'll code that up? ;P

  -jake

Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
In thinking more about this, it seems that it would be even better to just incorporate some of the ideas of this into the DocumentProcessor, except I think it is useful to not have to go to SeqFile first.  Also, it might be worth grabbing Solr's FilterFactory stuff for configuring the Lucene analyzers.  Not sure how easy that would be to do.

-Grant
On May 28, 2010, at 4:53 PM, Grant Ingersoll wrote:

> OK, I posted a draft patch of this.  Would appreciate a review.  I think it's even the case that one could slip Groovy into it (or whatever) through the proper implementation of one interface.  Feedback welcome on M-403.
> 
> 
> On May 28, 2010, at 10:05 AM, Grant Ingersoll wrote:
> 
>> https://issues.apache.org/jira/browse/MAHOUT-403
>> 
>> On May 28, 2010, at 8:58 AM, Grant Ingersoll wrote:
>> 
>>> 
>>> On May 27, 2010, at 7:06 PM, Ted Dunning wrote:
>>> 
>>>> That should be a small change (and helpful for a lot of mining tasks).
>>>> 
>>>> But once you jump on that slippery slope, why not allow a tiny Groovy
>>>> closure to be injected?  Or to pass in an object that will extract a map of
>>>> values from each line?
>>> 
>>> Expanding on this, I think we could do the following:
>>> 
>>> Map capturing groups to labels, then have pluggable output so that one could easily output to FPG, Classifiers, etc.
>>> 
>>> I'm not all that familiar w/ Groovy, so I'll put up my variation and then people can expand on it.
>>> 
>>> -Grant
>> 
>> 
> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
OK, I posted a draft patch of this.  Would appreciate a review.  I think it's even the case that one could slip Groovy into it (or whatever) through the proper implementation of one interface.  Feedback welcome on M-403.


On May 28, 2010, at 10:05 AM, Grant Ingersoll wrote:

> https://issues.apache.org/jira/browse/MAHOUT-403
> 
> On May 28, 2010, at 8:58 AM, Grant Ingersoll wrote:
> 
>> 
>> On May 27, 2010, at 7:06 PM, Ted Dunning wrote:
>> 
>>> That should be a small change (and helpful for a lot of mining tasks).
>>> 
>>> But once you jump on that slippery slope, why not allow a tiny Groovy
>>> closure to be injected?  Or to pass in an object that will extract a map of
>>> values from each line?
>> 
>> Expanding on this, I think we could do the following:
>> 
>> Map capturing groups to labels, then have pluggable output so that one could easily output to FPG, Classifiers, etc.
>> 
>> I'm not all that familiar w/ Groovy, so I'll put up my variation and then people can expand on it.
>> 
>> -Grant
> 
> 



Re: M/R Job for Log file to FPG

Posted by Robin Anil <ro...@gmail.com>.
oops that should be 4 instead of 3.

Re: M/R Job for Log file to FPG

Posted by Robin Anil <ro...@gmail.com>.
On Fri, May 28, 2010 at 7:39 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Robin,
>
> What I'll do here is make the code reusable so that we can use it in FPG directly as well.
>
Cool.

Btw there is one more thing missing. Make sure each item in an
itemset to the algorithm is formed of unique tokens. I dont things its
well handled in the sequential run, mapreduce run converts the info
into a transaction tree so the error goes away.
If that is removing information then need to split into multiple
transactions as follows


for example if a log record has
A B A A A C D
create following transactions from it to keep the correct co-occurrence counts
A B,  3
A C,  3
A D, 3
B C D, 1

Robin

Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
Robin,

What I'll do here is make the code reusable so that we can use it in FPG directly as well.

On May 28, 2010, at 10:05 AM, Grant Ingersoll wrote:

> https://issues.apache.org/jira/browse/MAHOUT-403
> 
> On May 28, 2010, at 8:58 AM, Grant Ingersoll wrote:
> 
>> 
>> On May 27, 2010, at 7:06 PM, Ted Dunning wrote:
>> 
>>> That should be a small change (and helpful for a lot of mining tasks).
>>> 
>>> But once you jump on that slippery slope, why not allow a tiny Groovy
>>> closure to be injected?  Or to pass in an object that will extract a map of
>>> values from each line?
>> 
>> Expanding on this, I think we could do the following:
>> 
>> Map capturing groups to labels, then have pluggable output so that one could easily output to FPG, Classifiers, etc.
>> 
>> I'm not all that familiar w/ Groovy, so I'll put up my variation and then people can expand on it.
>> 
>> -Grant
> 
> 



Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
https://issues.apache.org/jira/browse/MAHOUT-403

On May 28, 2010, at 8:58 AM, Grant Ingersoll wrote:

> 
> On May 27, 2010, at 7:06 PM, Ted Dunning wrote:
> 
>> That should be a small change (and helpful for a lot of mining tasks).
>> 
>> But once you jump on that slippery slope, why not allow a tiny Groovy
>> closure to be injected?  Or to pass in an object that will extract a map of
>> values from each line?
> 
> Expanding on this, I think we could do the following:
> 
> Map capturing groups to labels, then have pluggable output so that one could easily output to FPG, Classifiers, etc.
> 
> I'm not all that familiar w/ Groovy, so I'll put up my variation and then people can expand on it.
> 
> -Grant



Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
On May 27, 2010, at 7:06 PM, Ted Dunning wrote:

> That should be a small change (and helpful for a lot of mining tasks).
> 
> But once you jump on that slippery slope, why not allow a tiny Groovy
> closure to be injected?  Or to pass in an object that will extract a map of
> values from each line?

Expanding on this, I think we could do the following:

Map capturing groups to labels, then have pluggable output so that one could easily output to FPG, Classifiers, etc.

I'm not all that familiar w/ Groovy, so I'll put up my variation and then people can expand on it.

-Grant 

Re: M/R Job for Log file to FPG

Posted by Ted Dunning <te...@gmail.com>.
That should be a small change (and helpful for a lot of mining tasks).

But once you jump on that slippery slope, why not allow a tiny Groovy
closure to be injected?  Or to pass in an object that will extract a map of
values from each line?

On Thu, May 27, 2010 at 2:59 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Actually, I take it back, as it is not what I want.
>
> Given a line like:
> 127.0.0.1 -  -  [24/05/2010:01:19:22 +0000] "GET /solr/select?q=import
> statement&start=1(MORE HERE) HTTP/1.1" 200 37571
>
> I want to be able to grab the q parameter (ie. "import statement") and put
> that as my item for that document.  It needs to split the value within &q as
> well, so it should be "import" and "statement".
>
> I guess I was proposing that instead of splitting on the parameter, I would
> be only including those items that match a capturing group (all by default,
> but also optionally passed in)
>

Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
Actually, I take it back, as it is not what I want.

Given a line like:
127.0.0.1 -  -  [24/05/2010:01:19:22 +0000] "GET /solr/select?q=import statement&start=1(MORE HERE) HTTP/1.1" 200 37571

I want to be able to grab the q parameter (ie. "import statement") and put that as my item for that document.  It needs to split the value within &q as well, so it should be "import" and "statement".

I guess I was proposing that instead of splitting on the parameter, I would be only including those items that match a capturing group (all by default, but also optionally passed in)

On May 27, 2010, at 5:25 PM, Grant Ingersoll wrote:

> Cool, glad I asked.  It's almost what I want and good enough for now.  However, what if I have multiple matching groups in my regex?  I was thinking it would be nice to take in a list of the matching groups to include and then iterate over them and append by the separator.
> 
> On May 27, 2010, at 5:14 PM, Robin Anil wrote:
> 
>> fpg uses regex to split. Just add another option for using the regex
>> to match instead of splitting. Less work I guess
>> 
>> 
>> 
>> On Fri, May 28, 2010 at 2:42 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> I'd like to take a bunch of logs and extract a bit of each line and then put them into format for FPG.  Was thinking a simple M/R job that took in a regex would suffice and then output in the format for FPG.  Is that generally useful or am I missing something obvious?  I want to do FPG on my query logs and it seems like a generally useful conversion.  I suppose, in fact, it isn't even log specific.
>>> 
>>> -Grant
> 



Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
Cool, glad I asked.  It's almost what I want and good enough for now.  However, what if I have multiple matching groups in my regex?  I was thinking it would be nice to take in a list of the matching groups to include and then iterate over them and append by the separator.

On May 27, 2010, at 5:14 PM, Robin Anil wrote:

> fpg uses regex to split. Just add another option for using the regex
> to match instead of splitting. Less work I guess
> 
> 
> 
> On Fri, May 28, 2010 at 2:42 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'd like to take a bunch of logs and extract a bit of each line and then put them into format for FPG.  Was thinking a simple M/R job that took in a regex would suffice and then output in the format for FPG.  Is that generally useful or am I missing something obvious?  I want to do FPG on my query logs and it seems like a generally useful conversion.  I suppose, in fact, it isn't even log specific.
>> 
>> -Grant


Re: M/R Job for Log file to FPG

Posted by Grant Ingersoll <gs...@apache.org>.
Also, will it handle if the split results in no input?  From the looks of it, it will write an empty string with a count of 1.

On May 27, 2010, at 5:14 PM, Robin Anil wrote:

> fpg uses regex to split. Just add another option for using the regex
> to match instead of splitting. Less work I guess
> 
> 
> 
> On Fri, May 28, 2010 at 2:42 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'd like to take a bunch of logs and extract a bit of each line and then put them into format for FPG.  Was thinking a simple M/R job that took in a regex would suffice and then output in the format for FPG.  Is that generally useful or am I missing something obvious?  I want to do FPG on my query logs and it seems like a generally useful conversion.  I suppose, in fact, it isn't even log specific.
>> 
>> -Grant



Re: M/R Job for Log file to FPG

Posted by Robin Anil <ro...@gmail.com>.
fpg uses regex to split. Just add another option for using the regex
to match instead of splitting. Less work I guess



On Fri, May 28, 2010 at 2:42 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I'd like to take a bunch of logs and extract a bit of each line and then put them into format for FPG.  Was thinking a simple M/R job that took in a regex would suffice and then output in the format for FPG.  Is that generally useful or am I missing something obvious?  I want to do FPG on my query logs and it seems like a generally useful conversion.  I suppose, in fact, it isn't even log specific.
>
> -Grant