You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Chris Hartjes <ch...@littlehart.net> on 2010/01/11 19:29:31 UTC

Analyzing MySQL slow query logs using Pig + Hadoop

My apologies if this is the wrong mailing list to ask this question.  I've
started playing around with Pig and Hadoop, with the intention of using it
to do some analysis of a collection of MySQL slow query log files.  I am not
a Java programmer (been using PHP for a very long time, dabbled in other
languages as required) so I am slightly intimidated by the documentation in
Pig for writing your own UDF's.

If anyone has done anything like this, I would appreciate some tips and some
pointers on how to approach it.  Sure, I could hunker down and learn to use
some CLI tools for analyzing the slow query log, but then I couldn't use Pig
and Hadoop. ;)

-- 
Chris Hartjes

Re: Analyzing MySQL slow query logs using Pig + Hadoop

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

Chris Hartjes wrote:
> My apologies if this is the wrong mailing list to ask this question.  I've
> started playing around with Pig and Hadoop, with the intention of using it
> to do some analysis of a collection of MySQL slow query log files.  I am not
> a Java programmer (been using PHP for a very long time, dabbled in other
> languages as required) so I am slightly intimidated by the documentation in
> Pig for writing your own UDF's.
> 
> If anyone has done anything like this, I would appreciate some tips and some
> pointers on how to approach it.  Sure, I could hunker down and learn to use
> some CLI tools for analyzing the slow query log, but then I couldn't use Pig
> and Hadoop. ;)
> 

Until you are comfortable with pig udf's, you could use streaming and 
working with the language you are comfortable with ! There might be some 
things you might not be able to do (load/store, etc for example) ... but 
pretty much everything else should be possible.


Regards,
Mridul

Re: Analyzing MySQL slow query logs using Pig + Hadoop

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Thanks for the plug, Ricardo!
A word of warning regarding that blog post -- it's written to explain
things, not to show how one would run them in production. So it's a
bit verbose and does silly things like calling out to awk. Don't take
it as a style guide :-).

Someone recently commented that it's way too long for the job it does,
so I shrunk it -- here's the equivalent, but more terse version:

register /home/dvryaboy/src/pig/trunk/piggybank.jar;
DEFINE LogLoader storage.apachelog.CombinedLogLoader();
DEFINE iplookup `ipwrapper.sh $GEO` ship ('ipwrapper.sh')
cache('/home/dvryaboy/tmp/$GEO$GEO');

logs = LOAD '$LOGS' USING LogLoader as
   (remoteAddr, remoteLogname, user, time, method,
    uri, proto, status, bytes, referer, userAgent);

logs = FILTER logs BY bytes != '-' AND  uri matches '/apache.*'
    AND (NOT filtering.IsBotUA(userAgent));

with_country = STREAM notbots THROUGH `ipwrapper.sh $GEO`
       AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);

geo_uri_group_counts =
   ORDER (FOREACH (GROUP with_country BY country_code)
               GENERATE group, COUNT($1) AS cnt, SUM($1.bytes) AS
total_bytes) BY cnt DESC;

STORE geo_uri_group_counts INTO 'by_country.tsv';

by_state_cnt =
   ORDER(FOREACH (GROUP (FILTER with_country BY country_code == 'US') BY state)
           GENERATE group, COUNT(us_only.state) AS cnt,
SUM(us_only.bytes) AS total_bytes)
   BY cnt;

STORE by_state_cnt into 'by_state.tsv';

-- and more of the same for project_count

-Dmitriy

On Mon, Jan 11, 2010 at 11:35 AM, Ricardo Varela <ph...@gmail.com> wrote:
> hey Chris,
>
> If you find it hard to define UDFs, maybe you can start by using
> scripts written in PHP if you feel more comfortable with it, or even
> shell scripts. You can do that with the Pig streaming interface
> (http://wiki.apache.org/pig/PigStreamingFunctionalSpec) It won't have
> as much perf as the proper UDFs, but it is useful for trying (I often
> prototype with STREAM first and then create some UDFs if needed)
>
> I found the examples in the doc and in the following article from
> Dmitriy Ryaboy very useful to start with:
>
> http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
>
> Good luck in your tests and hope you like Pig and Hadoop!
>
> Saludos!
>
> ---
> ricardo
>
> On Mon, Jan 11, 2010 at 6:29 PM, Chris Hartjes <ch...@littlehart.net> wrote:
>> My apologies if this is the wrong mailing list to ask this question.  I've
>> started playing around with Pig and Hadoop, with the intention of using it
>> to do some analysis of a collection of MySQL slow query log files.  I am not
>> a Java programmer (been using PHP for a very long time, dabbled in other
>> languages as required) so I am slightly intimidated by the documentation in
>> Pig for writing your own UDF's.
>>
>> If anyone has done anything like this, I would appreciate some tips and some
>> pointers on how to approach it.  Sure, I could hunker down and learn to use
>> some CLI tools for analyzing the slow query log, but then I couldn't use Pig
>> and Hadoop. ;)
>>
>> --
>> Chris Hartjes
>>
>
>
>
> --
> Ricardo Varela  -  http://phobeo.com  -  http://twitter.com/phobeo
> "Though this be madness, yet there's method in 't"
>

Re: Analyzing MySQL slow query logs using Pig + Hadoop

Posted by Ricardo Varela <ph...@gmail.com>.

hey Chris,

If you find it hard to define UDFs, maybe you can start by using
scripts written in PHP if you feel more comfortable with it, or even
shell scripts. You can do that with the Pig streaming interface
(http://wiki.apache.org/pig/PigStreamingFunctionalSpec) It won't have
as much perf as the proper UDFs, but it is useful for trying (I often
prototype with STREAM first and then create some UDFs if needed)

I found the examples in the doc and in the following article from
Dmitriy Ryaboy very useful to start with:

http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/

Good luck in your tests and hope you like Pig and Hadoop!

Saludos!

---
ricardo

On Mon, Jan 11, 2010 at 6:29 PM, Chris Hartjes <ch...@littlehart.net> wrote:
> My apologies if this is the wrong mailing list to ask this question.  I've
> started playing around with Pig and Hadoop, with the intention of using it
> to do some analysis of a collection of MySQL slow query log files.  I am not
> a Java programmer (been using PHP for a very long time, dabbled in other
> languages as required) so I am slightly intimidated by the documentation in
> Pig for writing your own UDF's.
>
> If anyone has done anything like this, I would appreciate some tips and some
> pointers on how to approach it.  Sure, I could hunker down and learn to use
> some CLI tools for analyzing the slow query log, but then I couldn't use Pig
> and Hadoop. ;)
>
> --
> Chris Hartjes
>

-- 
Ricardo Varela  -  http://phobeo.com  -  http://twitter.com/phobeo
"Though this be madness, yet there's method in 't"

Re: Analyzing MySQL slow query logs using Pig + Hadoop

Posted by Rekha Joshi <re...@yahoo-inc.com>.

Hi Chris,

For just doing simple analysis, default functions provided in pig (Pig latin/cookbook) would be sufficient,UDF's might come into picture only if you need something customized.
Even then most of your needs can be directly met from the piggybank.

If you need to write one, please look into http://hadoop.apache.org/pig/docs/r0.5.0/udf.html. It has some sample UDFs and usage.

Cheers,
/R

On 1/11/10 11:59 PM, "Chris Hartjes" <ch...@littlehart.net> wrote:

My apologies if this is the wrong mailing list to ask this question.  I've
started playing around with Pig and Hadoop, with the intention of using it
to do some analysis of a collection of MySQL slow query log files.  I am not
a Java programmer (been using PHP for a very long time, dabbled in other
languages as required) so I am slightly intimidated by the documentation in
Pig for writing your own UDF's.

If anyone has done anything like this, I would appreciate some tips and some
pointers on how to approach it.  Sure, I could hunker down and learn to use
some CLI tools for analyzing the slow query log, but then I couldn't use Pig
and Hadoop. ;)

--
Chris Hartjes