You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by David Riccitelli <da...@insideout.io> on 2011/08/19 12:06:14 UTC

calculate requests and visits (nested groups?)

I'm analyzing a daily apache log file. I'd like to get the number of
requests and of visits by hour.

I managed to get the requests, but how do I get the visits?

grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
  FLATTEN(
    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
\\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
  ) AS (
    client:   chararray,
    username: chararray,
    date: chararray,
    hour: chararray,
    minute: chararray,
    second: chararray,
    timeZone: chararray,
    request:  chararray,
    statusCode: int,
    bytesSent: chararray,
    referer:  chararray,
    userAgent: chararray,
    remoteUser: chararray,
    timeTaken: chararray
);
grunt> A = GROUP LOGS_BASE BY hour;
DESCRIBE A;
A: {group: chararray,LOGS_BASE: {(client: chararray,username:
chararray,date: chararray,hour: chararray,minute: chararray,second:
chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
chararray,referer: chararray,userAgent: chararray,remoteUser:
chararray,timeTaken: chararray)}}
grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
grunt> C = ORDER B BY hour; -- requests by hour

How can I now get the distinct count of clients per hour?

Thanks for your help!

-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
Ok, seems that I've been able to solve it changing this (casting to  long, *
(bag{tuple(long)})logs.timeTaken*):
by_hour =
 foreach (group logs by hour) {
  dist_clients = distinct logs.client;
  max_time_taken = logs.timeTaken;
  generate
   group as hour,
   COUNT(dist_clients) as num_dist_clients,
   COUNT(logs) as total_requests,
   MAX( max_time_taken );

};

to this:
by_hour =
 foreach (group logs by hour) {
  dist_clients = distinct logs.client;
  max_time_taken = (bag{tuple(long)})logs.timeTaken;
  generate
   group as hour,
   COUNT(dist_clients) as num_dist_clients,
   COUNT(logs) as total_requests,
   MAX( max_time_taken );

};


David

On Fri, Aug 19, 2011 at 7:44 PM, David Riccitelli <da...@insideout.io>wrote:

> Sorry for this long sequence of messages, but I'm posting things as I
> continue testing/investigating.
>
> May be this relevant to my case?
>
> http://www.mail-archive.com/user@pig.apache.org/msg02258.html
>
> Thanks,
> David
>
>
> On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I tried changing this line, from:
>> RAW_LOGS = LOAD
>> '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING
>> TextLoader() AS (line:chararray);
>>
>> to:
>> RAW_LOGS = LOAD
>> '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING
>> PigStorage() AS (line:chararray);
>>
>> It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL
>> that produces the logs schema.
>>
>> Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
>> function?
>>
>> Thanks for your help,
>> David
>>
>> On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> I noticed that this issue arises only if I load the initial data with the
>>> TextLoader() and using the REGEX_EXTRACT_ALL.
>>>
>>> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
>>> REGEX_EXTRACT_ALL), it works.
>>>
>>> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>>>
>>> Does it make sense?
>>>
>>> David
>>>
>>>
>>> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
>>>
>>>> I still can't manage to accomplish my objectives. I'm trying to get now
>>>> the max time taken so, as a test, I do:
>>>> grunt> A = GROUP logs BY client;
>>>>
>>>> then (timeTaken is long):
>>>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>>>
>>>> when I dump it, I get the following error:
>>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>>>> while computing max in Initial
>>>> (...)
>>>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>>>> to java.lang.Long
>>>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>>>
>>>> Initially I thought that I had some timeTaken not compatible with long
>>>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>>>> regular expression.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>> David
>>>>
>>>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>>>>
>>>>> I tried with another log file and that does not happen, so I suppose
>>>>> there's some 'corrupted' line in the one I was testing.
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>>>>
>>>>>> There's something strage in the results however:
>>>>>> (00,129,30096)
>>>>>> (01,91,16487)
>>>>>> (02,57,11686)
>>>>>> (03,41,6041)
>>>>>> (04,30,4882)
>>>>>> (05,33,4154)
>>>>>> (06,65,8031)
>>>>>> (07,66,12260)
>>>>>> (08,95,17924)
>>>>>> (09,131,21187)
>>>>>> (10,162,26607)
>>>>>> (11,155,28503)
>>>>>> (12,146,27863)
>>>>>> (13,152,29130)
>>>>>> (14,159,32784)
>>>>>> (15,150,28898)
>>>>>> (16,143,28973)
>>>>>> (17,169,29024)
>>>>>> (18,199,26585)
>>>>>> (19,182,28803)
>>>>>> (20,224,32511)
>>>>>> (21,232,38584)
>>>>>> (22,225,39924)
>>>>>> (23,191,33606)
>>>>>> (,0,0)
>>>>>>
>>>>>>
>>>>>> What is the last line:
>>>>>>  (,0,0)
>>>>>> the count is zero, it shouldn't really be there, correct?
>>>>>>
>>>>>> (Using pig 0.9.0)
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>>>
>>>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>>>
>>>>>>> Don't trust relative measurements you get for small data on a single
>>>>>>> computer in local mode. Things change when you start running on
>>>>>>> hundreds of
>>>>>>> gigs with real skew on a cluster.
>>>>>>>
>>>>>>> D
>>>>>>>
>>>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <
>>>>>>> david@insideout.io>wrote:
>>>>>>>
>>>>>>> > Thanks Dmitriy,
>>>>>>> >
>>>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>>>> lines).
>>>>>>> > The first method is giving me the following error:
>>>>>>> >
>>>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>>>> [num_reqs]
>>>>>>> > does not exist in schema:
>>>>>>> >
>>>>>>> >
>>>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>>>> >
>>>>>>> > when I try to set the by_hour (after having set the
>>>>>>> by_hour_client):
>>>>>>> >
>>>>>>> > grunt> by_hour_client =
>>>>>>> > >>  foreach
>>>>>>> > >>    (group logs by (hour, client))
>>>>>>> > >>  generate
>>>>>>> > >>    flatten(group) as (hour, client),
>>>>>>> > >>    COUNT(logs) as num_reqs;
>>>>>>> > grunt> by_hour =
>>>>>>> > >>  foreach
>>>>>>> > >>    (group by_hour_client by hour)
>>>>>>> > >>  generate
>>>>>>> > >>    group as hour,
>>>>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>>>>> > >>    SUM(num_reqs) as total_requests;
>>>>>>> >
>>>>>>> > If I understood correctly that's because the num_reqs is in the
>>>>>>> bag, as a
>>>>>>> > result of the
>>>>>>> > *    (group by_hour_client by hour)*
>>>>>>> > correct? So I changed the last line to
>>>>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>>>> > and it worked (it took a little more than 29 seconds).
>>>>>>> >
>>>>>>> > Thanks for your help,
>>>>>>> > David
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <
>>>>>>> dvryaboy@gmail.com>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> > > by_hour_client =
>>>>>>> > >  foreach
>>>>>>> > >    (group logs by (hour, client) parallel $p)
>>>>>>> > >  generate
>>>>>>> > >    flatten(group) as (hour, client),
>>>>>>> > >    COUNT(logs) as num_reqs;
>>>>>>> > >
>>>>>>> > > by_hour =
>>>>>>> > >  foreach
>>>>>>> > >    (group by_hour_client by hour parallel $p2)
>>>>>>> > >  generate
>>>>>>> > >    group as hour,
>>>>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>>>>> > >    SUM(num_reqs) as total_requests;
>>>>>>> > >
>>>>>>> > > You can also do this using a nested distinct, but depending on
>>>>>>> what your
>>>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>>>> pressure
>>>>>>> > on
>>>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>>>> > (although
>>>>>>> > > they do push part of this up to the mappers):
>>>>>>> > >
>>>>>>> > > by_hour =
>>>>>>> > >  foreach (group logs by hour) {
>>>>>>> > >   dist_clients = distinct logs.client;
>>>>>>> > >   generate
>>>>>>> > >    group as hour,
>>>>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>>>>> > >    COUNT(logs) as total_requests;
>>>>>>> > > }
>>>>>>> > >
>>>>>>> > > D
>>>>>>> > >
>>>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>>>> david@insideout.io
>>>>>>> > > >wrote:
>>>>>>> > >
>>>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the
>>>>>>> number of
>>>>>>> > > > requests and of visits by hour.
>>>>>>> > > >
>>>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>>>> > > >
>>>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>>>> > > (line:chararray);
>>>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>>>> > > >  FLATTEN(
>>>>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>>>> > (\\+\\d{4})\\]
>>>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>>>> > > >  ) AS (
>>>>>>> > > >    client:   chararray,
>>>>>>> > > >    username: chararray,
>>>>>>> > > >    date: chararray,
>>>>>>> > > >    hour: chararray,
>>>>>>> > > >    minute: chararray,
>>>>>>> > > >    second: chararray,
>>>>>>> > > >    timeZone: chararray,
>>>>>>> > > >    request:  chararray,
>>>>>>> > > >    statusCode: int,
>>>>>>> > > >    bytesSent: chararray,
>>>>>>> > > >    referer:  chararray,
>>>>>>> > > >    userAgent: chararray,
>>>>>>> > > >    remoteUser: chararray,
>>>>>>> > > >    timeTaken: chararray
>>>>>>> > > > );
>>>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>>>> > > > DESCRIBE A;
>>>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>>>> chararray,second:
>>>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>>>> > > int,bytesSent:
>>>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>>>> > > > chararray,timeTaken: chararray)}}
>>>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>>>> > > >
>>>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>>>> > > >
>>>>>>> > > > Thanks for your help!
>>>>>>> > > >
>>>>>>> > > > --
>>>>>>> > > > David Riccitelli
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > > InsideOut10 s.r.l.
>>>>>>> > > > P.IVA: IT-11381771002
>>>>>>> > > > Fax: +39 0110708239
>>>>>>> > > > ---
>>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > > > Twitter: ziodave
>>>>>>> > > > ---
>>>>>>> > > > Layar Partner Network<
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > --
>>>>>>> > > > David Riccitelli
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > > InsideOut10 s.r.l.
>>>>>>> > > > P.IVA: IT-11381771002
>>>>>>> > > > Fax: +39 0110708239
>>>>>>> > > > ---
>>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > > > Twitter: ziodave
>>>>>>> > > > ---
>>>>>>> > > > Layar Partner Network<
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > David Riccitelli
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > InsideOut10 s.r.l.
>>>>>>> > P.IVA: IT-11381771002
>>>>>>> > Fax: +39 0110708239
>>>>>>> > ---
>>>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > Twitter: ziodave
>>>>>>> > ---
>>>>>>> > Layar Partner Network<
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > >
>>>>>>> >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Riccitelli
>>>>>>
>>>>>>
>>>>>> ********************************************************************************
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>>
>>>>>> ********************************************************************************
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Riccitelli
>>>>>
>>>>>
>>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>
>>>>> ********************************************************************************
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
Sorry for this long sequence of messages, but I'm posting things as I
continue testing/investigating.

May be this relevant to my case?

http://www.mail-archive.com/user@pig.apache.org/msg02258.html

Thanks,
David

On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <da...@insideout.io>wrote:

> I tried changing this line, from:
> RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
> USING TextLoader() AS (line:chararray);
>
> to:
> RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
> USING PigStorage() AS (line:chararray);
>
> It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL
> that produces the logs schema.
>
> Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
> function?
>
> Thanks for your help,
> David
>
> On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I noticed that this issue arises only if I load the initial data with the
>> TextLoader() and using the REGEX_EXTRACT_ALL.
>>
>> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
>> REGEX_EXTRACT_ALL), it works.
>>
>> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>>
>> Does it make sense?
>>
>> David
>>
>>
>> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> I still can't manage to accomplish my objectives. I'm trying to get now
>>> the max time taken so, as a test, I do:
>>> grunt> A = GROUP logs BY client;
>>>
>>> then (timeTaken is long):
>>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>>
>>> when I dump it, I get the following error:
>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>>> while computing max in Initial
>>> (...)
>>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>>> to java.lang.Long
>>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>>
>>> Initially I thought that I had some timeTaken not compatible with long
>>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>>> regular expression.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>> David
>>>
>>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>>>
>>>> I tried with another log file and that does not happen, so I suppose
>>>> there's some 'corrupted' line in the one I was testing.
>>>>
>>>>
>>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>>>
>>>>> There's something strage in the results however:
>>>>> (00,129,30096)
>>>>> (01,91,16487)
>>>>> (02,57,11686)
>>>>> (03,41,6041)
>>>>> (04,30,4882)
>>>>> (05,33,4154)
>>>>> (06,65,8031)
>>>>> (07,66,12260)
>>>>> (08,95,17924)
>>>>> (09,131,21187)
>>>>> (10,162,26607)
>>>>> (11,155,28503)
>>>>> (12,146,27863)
>>>>> (13,152,29130)
>>>>> (14,159,32784)
>>>>> (15,150,28898)
>>>>> (16,143,28973)
>>>>> (17,169,29024)
>>>>> (18,199,26585)
>>>>> (19,182,28803)
>>>>> (20,224,32511)
>>>>> (21,232,38584)
>>>>> (22,225,39924)
>>>>> (23,191,33606)
>>>>> (,0,0)
>>>>>
>>>>>
>>>>> What is the last line:
>>>>>  (,0,0)
>>>>> the count is zero, it shouldn't really be there, correct?
>>>>>
>>>>> (Using pig 0.9.0)
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>>
>>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>>
>>>>>> Don't trust relative measurements you get for small data on a single
>>>>>> computer in local mode. Things change when you start running on
>>>>>> hundreds of
>>>>>> gigs with real skew on a cluster.
>>>>>>
>>>>>> D
>>>>>>
>>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>>>>> >wrote:
>>>>>>
>>>>>> > Thanks Dmitriy,
>>>>>> >
>>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>>> lines).
>>>>>> > The first method is giving me the following error:
>>>>>> >
>>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>>> [num_reqs]
>>>>>> > does not exist in schema:
>>>>>> >
>>>>>> >
>>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>>> >
>>>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>>>> >
>>>>>> > grunt> by_hour_client =
>>>>>> > >>  foreach
>>>>>> > >>    (group logs by (hour, client))
>>>>>> > >>  generate
>>>>>> > >>    flatten(group) as (hour, client),
>>>>>> > >>    COUNT(logs) as num_reqs;
>>>>>> > grunt> by_hour =
>>>>>> > >>  foreach
>>>>>> > >>    (group by_hour_client by hour)
>>>>>> > >>  generate
>>>>>> > >>    group as hour,
>>>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>>>> > >>    SUM(num_reqs) as total_requests;
>>>>>> >
>>>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>>>> as a
>>>>>> > result of the
>>>>>> > *    (group by_hour_client by hour)*
>>>>>> > correct? So I changed the last line to
>>>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>>> > and it worked (it took a little more than 29 seconds).
>>>>>> >
>>>>>> > Thanks for your help,
>>>>>> > David
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
>>>>>> >
>>>>>> > wrote:
>>>>>> >
>>>>>> > > by_hour_client =
>>>>>> > >  foreach
>>>>>> > >    (group logs by (hour, client) parallel $p)
>>>>>> > >  generate
>>>>>> > >    flatten(group) as (hour, client),
>>>>>> > >    COUNT(logs) as num_reqs;
>>>>>> > >
>>>>>> > > by_hour =
>>>>>> > >  foreach
>>>>>> > >    (group by_hour_client by hour parallel $p2)
>>>>>> > >  generate
>>>>>> > >    group as hour,
>>>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>>>> > >    SUM(num_reqs) as total_requests;
>>>>>> > >
>>>>>> > > You can also do this using a nested distinct, but depending on
>>>>>> what your
>>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>>> pressure
>>>>>> > on
>>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>>> > (although
>>>>>> > > they do push part of this up to the mappers):
>>>>>> > >
>>>>>> > > by_hour =
>>>>>> > >  foreach (group logs by hour) {
>>>>>> > >   dist_clients = distinct logs.client;
>>>>>> > >   generate
>>>>>> > >    group as hour,
>>>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>>>> > >    COUNT(logs) as total_requests;
>>>>>> > > }
>>>>>> > >
>>>>>> > > D
>>>>>> > >
>>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>>> david@insideout.io
>>>>>> > > >wrote:
>>>>>> > >
>>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the
>>>>>> number of
>>>>>> > > > requests and of visits by hour.
>>>>>> > > >
>>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>>> > > >
>>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>>> > > (line:chararray);
>>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>>> > > >  FLATTEN(
>>>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>>> > (\\+\\d{4})\\]
>>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>>> > > >  ) AS (
>>>>>> > > >    client:   chararray,
>>>>>> > > >    username: chararray,
>>>>>> > > >    date: chararray,
>>>>>> > > >    hour: chararray,
>>>>>> > > >    minute: chararray,
>>>>>> > > >    second: chararray,
>>>>>> > > >    timeZone: chararray,
>>>>>> > > >    request:  chararray,
>>>>>> > > >    statusCode: int,
>>>>>> > > >    bytesSent: chararray,
>>>>>> > > >    referer:  chararray,
>>>>>> > > >    userAgent: chararray,
>>>>>> > > >    remoteUser: chararray,
>>>>>> > > >    timeTaken: chararray
>>>>>> > > > );
>>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>>> > > > DESCRIBE A;
>>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>>> chararray,second:
>>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>>> > > int,bytesSent:
>>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>>> > > > chararray,timeTaken: chararray)}}
>>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>>> > > >
>>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>>> > > >
>>>>>> > > > Thanks for your help!
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > David Riccitelli
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > > InsideOut10 s.r.l.
>>>>>> > > > P.IVA: IT-11381771002
>>>>>> > > > Fax: +39 0110708239
>>>>>> > > > ---
>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > > > Twitter: ziodave
>>>>>> > > > ---
>>>>>> > > > Layar Partner Network<
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > David Riccitelli
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > > InsideOut10 s.r.l.
>>>>>> > > > P.IVA: IT-11381771002
>>>>>> > > > Fax: +39 0110708239
>>>>>> > > > ---
>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > > > Twitter: ziodave
>>>>>> > > > ---
>>>>>> > > > Layar Partner Network<
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > David Riccitelli
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > InsideOut10 s.r.l.
>>>>>> > P.IVA: IT-11381771002
>>>>>> > Fax: +39 0110708239
>>>>>> > ---
>>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > Twitter: ziodave
>>>>>> > ---
>>>>>> > Layar Partner Network<
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Riccitelli
>>>>>
>>>>>
>>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>
>>>>> ********************************************************************************
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
I tried changing this line, from:
RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
USING TextLoader() AS (line:chararray);

to:
RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
USING PigStorage() AS (line:chararray);

It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL that
produces the logs schema.

Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
function?

Thanks for your help,
David

On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <da...@insideout.io>wrote:

> I noticed that this issue arises only if I load the initial data with the
> TextLoader() and using the REGEX_EXTRACT_ALL.
>
> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
> REGEX_EXTRACT_ALL), it works.
>
> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>
> Does it make sense?
>
> David
>
>
> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I still can't manage to accomplish my objectives. I'm trying to get now
>> the max time taken so, as a test, I do:
>> grunt> A = GROUP logs BY client;
>>
>> then (timeTaken is long):
>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>
>> when I dump it, I get the following error:
>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>> while computing max in Initial
>> (...)
>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>> to java.lang.Long
>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>
>> Initially I thought that I had some timeTaken not compatible with long
>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>> regular expression.
>>
>> What am I doing wrong?
>>
>> Thanks!
>> David
>>
>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> I tried with another log file and that does not happen, so I suppose
>>> there's some 'corrupted' line in the one I was testing.
>>>
>>>
>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>>
>>>> There's something strage in the results however:
>>>> (00,129,30096)
>>>> (01,91,16487)
>>>> (02,57,11686)
>>>> (03,41,6041)
>>>> (04,30,4882)
>>>> (05,33,4154)
>>>> (06,65,8031)
>>>> (07,66,12260)
>>>> (08,95,17924)
>>>> (09,131,21187)
>>>> (10,162,26607)
>>>> (11,155,28503)
>>>> (12,146,27863)
>>>> (13,152,29130)
>>>> (14,159,32784)
>>>> (15,150,28898)
>>>> (16,143,28973)
>>>> (17,169,29024)
>>>> (18,199,26585)
>>>> (19,182,28803)
>>>> (20,224,32511)
>>>> (21,232,38584)
>>>> (22,225,39924)
>>>> (23,191,33606)
>>>> (,0,0)
>>>>
>>>>
>>>> What is the last line:
>>>>  (,0,0)
>>>> the count is zero, it shouldn't really be there, correct?
>>>>
>>>> (Using pig 0.9.0)
>>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>
>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>
>>>>> Don't trust relative measurements you get for small data on a single
>>>>> computer in local mode. Things change when you start running on
>>>>> hundreds of
>>>>> gigs with real skew on a cluster.
>>>>>
>>>>> D
>>>>>
>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>>>> >wrote:
>>>>>
>>>>> > Thanks Dmitriy,
>>>>> >
>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>> lines).
>>>>> > The first method is giving me the following error:
>>>>> >
>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>> [num_reqs]
>>>>> > does not exist in schema:
>>>>> >
>>>>> >
>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>> >
>>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>>> >
>>>>> > grunt> by_hour_client =
>>>>> > >>  foreach
>>>>> > >>    (group logs by (hour, client))
>>>>> > >>  generate
>>>>> > >>    flatten(group) as (hour, client),
>>>>> > >>    COUNT(logs) as num_reqs;
>>>>> > grunt> by_hour =
>>>>> > >>  foreach
>>>>> > >>    (group by_hour_client by hour)
>>>>> > >>  generate
>>>>> > >>    group as hour,
>>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>>> > >>    SUM(num_reqs) as total_requests;
>>>>> >
>>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>>> as a
>>>>> > result of the
>>>>> > *    (group by_hour_client by hour)*
>>>>> > correct? So I changed the last line to
>>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>> > and it worked (it took a little more than 29 seconds).
>>>>> >
>>>>> > Thanks for your help,
>>>>> > David
>>>>> >
>>>>> >
>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> > > by_hour_client =
>>>>> > >  foreach
>>>>> > >    (group logs by (hour, client) parallel $p)
>>>>> > >  generate
>>>>> > >    flatten(group) as (hour, client),
>>>>> > >    COUNT(logs) as num_reqs;
>>>>> > >
>>>>> > > by_hour =
>>>>> > >  foreach
>>>>> > >    (group by_hour_client by hour parallel $p2)
>>>>> > >  generate
>>>>> > >    group as hour,
>>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>>> > >    SUM(num_reqs) as total_requests;
>>>>> > >
>>>>> > > You can also do this using a nested distinct, but depending on what
>>>>> your
>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>> pressure
>>>>> > on
>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>> > (although
>>>>> > > they do push part of this up to the mappers):
>>>>> > >
>>>>> > > by_hour =
>>>>> > >  foreach (group logs by hour) {
>>>>> > >   dist_clients = distinct logs.client;
>>>>> > >   generate
>>>>> > >    group as hour,
>>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>>> > >    COUNT(logs) as total_requests;
>>>>> > > }
>>>>> > >
>>>>> > > D
>>>>> > >
>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>> david@insideout.io
>>>>> > > >wrote:
>>>>> > >
>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>>>> of
>>>>> > > > requests and of visits by hour.
>>>>> > > >
>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>> > > >
>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>> > > (line:chararray);
>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>> > > >  FLATTEN(
>>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>> > (\\+\\d{4})\\]
>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>> > > >  ) AS (
>>>>> > > >    client:   chararray,
>>>>> > > >    username: chararray,
>>>>> > > >    date: chararray,
>>>>> > > >    hour: chararray,
>>>>> > > >    minute: chararray,
>>>>> > > >    second: chararray,
>>>>> > > >    timeZone: chararray,
>>>>> > > >    request:  chararray,
>>>>> > > >    statusCode: int,
>>>>> > > >    bytesSent: chararray,
>>>>> > > >    referer:  chararray,
>>>>> > > >    userAgent: chararray,
>>>>> > > >    remoteUser: chararray,
>>>>> > > >    timeTaken: chararray
>>>>> > > > );
>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>> > > > DESCRIBE A;
>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>> chararray,second:
>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>> > > int,bytesSent:
>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>> > > > chararray,timeTaken: chararray)}}
>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>> > > >
>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>> > > >
>>>>> > > > Thanks for your help!
>>>>> > > >
>>>>> > > > --
>>>>> > > > David Riccitelli
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > > InsideOut10 s.r.l.
>>>>> > > > P.IVA: IT-11381771002
>>>>> > > > Fax: +39 0110708239
>>>>> > > > ---
>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > > > Twitter: ziodave
>>>>> > > > ---
>>>>> > > > Layar Partner Network<
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > > --
>>>>> > > > David Riccitelli
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > > InsideOut10 s.r.l.
>>>>> > > > P.IVA: IT-11381771002
>>>>> > > > Fax: +39 0110708239
>>>>> > > > ---
>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > > > Twitter: ziodave
>>>>> > > > ---
>>>>> > > > Layar Partner Network<
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > David Riccitelli
>>>>> >
>>>>> >
>>>>> >
>>>>> ********************************************************************************
>>>>> > InsideOut10 s.r.l.
>>>>> > P.IVA: IT-11381771002
>>>>> > Fax: +39 0110708239
>>>>> > ---
>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > Twitter: ziodave
>>>>> > ---
>>>>> > Layar Partner Network<
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > >
>>>>> >
>>>>> >
>>>>> ********************************************************************************
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
I noticed that this issue arises only if I load the initial data with the
TextLoader() and using the REGEX_EXTRACT_ALL.

If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
REGEX_EXTRACT_ALL), it works.

But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...

Does it make sense?

David

On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:

> I still can't manage to accomplish my objectives. I'm trying to get now the
> max time taken so, as a test, I do:
> grunt> A = GROUP logs BY client;
>
> then (timeTaken is long):
> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>
> when I dump it, I get the following error:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
> while computing max in Initial
> (...)
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
> java.lang.Long
> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>
> Initially I thought that I had some timeTaken not compatible with long data
> type, but I checked and re-checked. I also get the timeTaken as \d+ regular
> expression.
>
> What am I doing wrong?
>
> Thanks!
> David
>
> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I tried with another log file and that does not happen, so I suppose
>> there's some 'corrupted' line in the one I was testing.
>>
>>
>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> There's something strage in the results however:
>>> (00,129,30096)
>>> (01,91,16487)
>>> (02,57,11686)
>>> (03,41,6041)
>>> (04,30,4882)
>>> (05,33,4154)
>>> (06,65,8031)
>>> (07,66,12260)
>>> (08,95,17924)
>>> (09,131,21187)
>>> (10,162,26607)
>>> (11,155,28503)
>>> (12,146,27863)
>>> (13,152,29130)
>>> (14,159,32784)
>>> (15,150,28898)
>>> (16,143,28973)
>>> (17,169,29024)
>>> (18,199,26585)
>>> (19,182,28803)
>>> (20,224,32511)
>>> (21,232,38584)
>>> (22,225,39924)
>>> (23,191,33606)
>>> (,0,0)
>>>
>>>
>>> What is the last line:
>>>  (,0,0)
>>> the count is zero, it shouldn't really be there, correct?
>>>
>>> (Using pig 0.9.0)
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>
>>>> Right, that should read "by_hour_client.num_reqs".
>>>>
>>>> Don't trust relative measurements you get for small data on a single
>>>> computer in local mode. Things change when you start running on hundreds
>>>> of
>>>> gigs with real skew on a cluster.
>>>>
>>>> D
>>>>
>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>>> >wrote:
>>>>
>>>> > Thanks Dmitriy,
>>>> >
>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>> lines).
>>>> > The first method is giving me the following error:
>>>> >
>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>> [num_reqs]
>>>> > does not exist in schema:
>>>> >
>>>> >
>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>> >
>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>> >
>>>> > grunt> by_hour_client =
>>>> > >>  foreach
>>>> > >>    (group logs by (hour, client))
>>>> > >>  generate
>>>> > >>    flatten(group) as (hour, client),
>>>> > >>    COUNT(logs) as num_reqs;
>>>> > grunt> by_hour =
>>>> > >>  foreach
>>>> > >>    (group by_hour_client by hour)
>>>> > >>  generate
>>>> > >>    group as hour,
>>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>>> > >>    SUM(num_reqs) as total_requests;
>>>> >
>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>> as a
>>>> > result of the
>>>> > *    (group by_hour_client by hour)*
>>>> > correct? So I changed the last line to
>>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>>> > and it worked (it took a little more than 29 seconds).
>>>> >
>>>> > Thanks for your help,
>>>> > David
>>>> >
>>>> >
>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > by_hour_client =
>>>> > >  foreach
>>>> > >    (group logs by (hour, client) parallel $p)
>>>> > >  generate
>>>> > >    flatten(group) as (hour, client),
>>>> > >    COUNT(logs) as num_reqs;
>>>> > >
>>>> > > by_hour =
>>>> > >  foreach
>>>> > >    (group by_hour_client by hour parallel $p2)
>>>> > >  generate
>>>> > >    group as hour,
>>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>>> > >    SUM(num_reqs) as total_requests;
>>>> > >
>>>> > > You can also do this using a nested distinct, but depending on what
>>>> your
>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>> pressure
>>>> > on
>>>> > > individual reducers that have to do the inner distinct in memory
>>>> > (although
>>>> > > they do push part of this up to the mappers):
>>>> > >
>>>> > > by_hour =
>>>> > >  foreach (group logs by hour) {
>>>> > >   dist_clients = distinct logs.client;
>>>> > >   generate
>>>> > >    group as hour,
>>>> > >    COUNT(dist_clients) as num_dist_clients,
>>>> > >    COUNT(logs) as total_requests;
>>>> > > }
>>>> > >
>>>> > > D
>>>> > >
>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>> david@insideout.io
>>>> > > >wrote:
>>>> > >
>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>>> of
>>>> > > > requests and of visits by hour.
>>>> > > >
>>>> > > > I managed to get the requests, but how do I get the visits?
>>>> > > >
>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>> > > (line:chararray);
>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>> > > >  FLATTEN(
>>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>> > (\\+\\d{4})\\]
>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>> > > >  ) AS (
>>>> > > >    client:   chararray,
>>>> > > >    username: chararray,
>>>> > > >    date: chararray,
>>>> > > >    hour: chararray,
>>>> > > >    minute: chararray,
>>>> > > >    second: chararray,
>>>> > > >    timeZone: chararray,
>>>> > > >    request:  chararray,
>>>> > > >    statusCode: int,
>>>> > > >    bytesSent: chararray,
>>>> > > >    referer:  chararray,
>>>> > > >    userAgent: chararray,
>>>> > > >    remoteUser: chararray,
>>>> > > >    timeTaken: chararray
>>>> > > > );
>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>> > > > DESCRIBE A;
>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>> chararray,second:
>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>> > > int,bytesSent:
>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>> > > > chararray,timeTaken: chararray)}}
>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>> > > >
>>>> > > > How can I now get the distinct count of clients per hour?
>>>> > > >
>>>> > > > Thanks for your help!
>>>> > > >
>>>> > > > --
>>>> > > > David Riccitelli
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > > InsideOut10 s.r.l.
>>>> > > > P.IVA: IT-11381771002
>>>> > > > Fax: +39 0110708239
>>>> > > > ---
>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > > > Twitter: ziodave
>>>> > > > ---
>>>> > > > Layar Partner Network<
>>>> > > >
>>>> > >
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > --
>>>> > > > David Riccitelli
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > > InsideOut10 s.r.l.
>>>> > > > P.IVA: IT-11381771002
>>>> > > > Fax: +39 0110708239
>>>> > > > ---
>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > > > Twitter: ziodave
>>>> > > > ---
>>>> > > > Layar Partner Network<
>>>> > > >
>>>> > >
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > David Riccitelli
>>>> >
>>>> >
>>>> >
>>>> ********************************************************************************
>>>> > InsideOut10 s.r.l.
>>>> > P.IVA: IT-11381771002
>>>> > Fax: +39 0110708239
>>>> > ---
>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > Twitter: ziodave
>>>> > ---
>>>> > Layar Partner Network<
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > >
>>>> >
>>>> >
>>>> ********************************************************************************
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
I still can't manage to accomplish my objectives. I'm trying to get now the
max time taken so, as a test, I do:
grunt> A = GROUP logs BY client;

then (timeTaken is long):
B = FOREACH A GENERATE group, MAX( logs.timeTaken );

when I dump it, I get the following error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
while computing max in Initial
(...)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Long
at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)

Initially I thought that I had some timeTaken not compatible with long data
type, but I checked and re-checked. I also get the timeTaken as \d+ regular
expression.

What am I doing wrong?

Thanks!
David

On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:

> I tried with another log file and that does not happen, so I suppose
> there's some 'corrupted' line in the one I was testing.
>
>
> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> There's something strage in the results however:
>> (00,129,30096)
>> (01,91,16487)
>> (02,57,11686)
>> (03,41,6041)
>> (04,30,4882)
>> (05,33,4154)
>> (06,65,8031)
>> (07,66,12260)
>> (08,95,17924)
>> (09,131,21187)
>> (10,162,26607)
>> (11,155,28503)
>> (12,146,27863)
>> (13,152,29130)
>> (14,159,32784)
>> (15,150,28898)
>> (16,143,28973)
>> (17,169,29024)
>> (18,199,26585)
>> (19,182,28803)
>> (20,224,32511)
>> (21,232,38584)
>> (22,225,39924)
>> (23,191,33606)
>> (,0,0)
>>
>>
>> What is the last line:
>>  (,0,0)
>> the count is zero, it shouldn't really be there, correct?
>>
>> (Using pig 0.9.0)
>>
>>
>> Thanks,
>> David
>>
>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>
>>> Right, that should read "by_hour_client.num_reqs".
>>>
>>> Don't trust relative measurements you get for small data on a single
>>> computer in local mode. Things change when you start running on hundreds
>>> of
>>> gigs with real skew on a cluster.
>>>
>>> D
>>>
>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>> >wrote:
>>>
>>> > Thanks Dmitriy,
>>> >
>>> > The second method took less than 26 secs. on my computer (~550.000
>>> lines).
>>> > The first method is giving me the following error:
>>> >
>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>> > <line 34, column 7> Invalid field projection. Projected field
>>> [num_reqs]
>>> > does not exist in schema:
>>> >
>>> >
>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>> >
>>> > when I try to set the by_hour (after having set the by_hour_client):
>>> >
>>> > grunt> by_hour_client =
>>> > >>  foreach
>>> > >>    (group logs by (hour, client))
>>> > >>  generate
>>> > >>    flatten(group) as (hour, client),
>>> > >>    COUNT(logs) as num_reqs;
>>> > grunt> by_hour =
>>> > >>  foreach
>>> > >>    (group by_hour_client by hour)
>>> > >>  generate
>>> > >>    group as hour,
>>> > >>    COUNT(by_hour_client) as num_dist_clients,
>>> > >>    SUM(num_reqs) as total_requests;
>>> >
>>> > If I understood correctly that's because the num_reqs is in the bag, as
>>> a
>>> > result of the
>>> > *    (group by_hour_client by hour)*
>>> > correct? So I changed the last line to
>>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>>> > and it worked (it took a little more than 29 seconds).
>>> >
>>> > Thanks for your help,
>>> > David
>>> >
>>> >
>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>>> > wrote:
>>> >
>>> > > by_hour_client =
>>> > >  foreach
>>> > >    (group logs by (hour, client) parallel $p)
>>> > >  generate
>>> > >    flatten(group) as (hour, client),
>>> > >    COUNT(logs) as num_reqs;
>>> > >
>>> > > by_hour =
>>> > >  foreach
>>> > >    (group by_hour_client by hour parallel $p2)
>>> > >  generate
>>> > >    group as hour,
>>> > >    COUNT(by_hour_client) as num_dist_clients,
>>> > >    SUM(num_reqs) as total_requests;
>>> > >
>>> > > You can also do this using a nested distinct, but depending on what
>>> your
>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>> pressure
>>> > on
>>> > > individual reducers that have to do the inner distinct in memory
>>> > (although
>>> > > they do push part of this up to the mappers):
>>> > >
>>> > > by_hour =
>>> > >  foreach (group logs by hour) {
>>> > >   dist_clients = distinct logs.client;
>>> > >   generate
>>> > >    group as hour,
>>> > >    COUNT(dist_clients) as num_dist_clients,
>>> > >    COUNT(logs) as total_requests;
>>> > > }
>>> > >
>>> > > D
>>> > >
>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>> david@insideout.io
>>> > > >wrote:
>>> > >
>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>> of
>>> > > > requests and of visits by hour.
>>> > > >
>>> > > > I managed to get the requests, but how do I get the visits?
>>> > > >
>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>> > > (line:chararray);
>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>> > > >  FLATTEN(
>>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>> > (\\+\\d{4})\\]
>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>> > > >  ) AS (
>>> > > >    client:   chararray,
>>> > > >    username: chararray,
>>> > > >    date: chararray,
>>> > > >    hour: chararray,
>>> > > >    minute: chararray,
>>> > > >    second: chararray,
>>> > > >    timeZone: chararray,
>>> > > >    request:  chararray,
>>> > > >    statusCode: int,
>>> > > >    bytesSent: chararray,
>>> > > >    referer:  chararray,
>>> > > >    userAgent: chararray,
>>> > > >    remoteUser: chararray,
>>> > > >    timeTaken: chararray
>>> > > > );
>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>> > > > DESCRIBE A;
>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>> > > int,bytesSent:
>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>> > > > chararray,timeTaken: chararray)}}
>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>> > > >
>>> > > > How can I now get the distinct count of clients per hour?
>>> > > >
>>> > > > Thanks for your help!
>>> > > >
>>> > > > --
>>> > > > David Riccitelli
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > > InsideOut10 s.r.l.
>>> > > > P.IVA: IT-11381771002
>>> > > > Fax: +39 0110708239
>>> > > > ---
>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > > > Twitter: ziodave
>>> > > > ---
>>> > > > Layar Partner Network<
>>> > > >
>>> > >
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > David Riccitelli
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > > InsideOut10 s.r.l.
>>> > > > P.IVA: IT-11381771002
>>> > > > Fax: +39 0110708239
>>> > > > ---
>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > > > Twitter: ziodave
>>> > > > ---
>>> > > > Layar Partner Network<
>>> > > >
>>> > >
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > David Riccitelli
>>> >
>>> >
>>> >
>>> ********************************************************************************
>>> > InsideOut10 s.r.l.
>>> > P.IVA: IT-11381771002
>>> > Fax: +39 0110708239
>>> > ---
>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > Twitter: ziodave
>>> > ---
>>> > Layar Partner Network<
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > >
>>> >
>>> >
>>> ********************************************************************************
>>> >
>>>
>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
I tried with another log file and that does not happen, so I suppose there's
some 'corrupted' line in the one I was testing.

On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:

> There's something strage in the results however:
> (00,129,30096)
> (01,91,16487)
> (02,57,11686)
> (03,41,6041)
> (04,30,4882)
> (05,33,4154)
> (06,65,8031)
> (07,66,12260)
> (08,95,17924)
> (09,131,21187)
> (10,162,26607)
> (11,155,28503)
> (12,146,27863)
> (13,152,29130)
> (14,159,32784)
> (15,150,28898)
> (16,143,28973)
> (17,169,29024)
> (18,199,26585)
> (19,182,28803)
> (20,224,32511)
> (21,232,38584)
> (22,225,39924)
> (23,191,33606)
> (,0,0)
>
>
> What is the last line:
>  (,0,0)
> the count is zero, it shouldn't really be there, correct?
>
> (Using pig 0.9.0)
>
>
> Thanks,
> David
>
> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Right, that should read "by_hour_client.num_reqs".
>>
>> Don't trust relative measurements you get for small data on a single
>> computer in local mode. Things change when you start running on hundreds
>> of
>> gigs with real skew on a cluster.
>>
>> D
>>
>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>> >wrote:
>>
>> > Thanks Dmitriy,
>> >
>> > The second method took less than 26 secs. on my computer (~550.000
>> lines).
>> > The first method is giving me the following error:
>> >
>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>> > <line 34, column 7> Invalid field projection. Projected field [num_reqs]
>> > does not exist in schema:
>> >
>> >
>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>> >
>> > when I try to set the by_hour (after having set the by_hour_client):
>> >
>> > grunt> by_hour_client =
>> > >>  foreach
>> > >>    (group logs by (hour, client))
>> > >>  generate
>> > >>    flatten(group) as (hour, client),
>> > >>    COUNT(logs) as num_reqs;
>> > grunt> by_hour =
>> > >>  foreach
>> > >>    (group by_hour_client by hour)
>> > >>  generate
>> > >>    group as hour,
>> > >>    COUNT(by_hour_client) as num_dist_clients,
>> > >>    SUM(num_reqs) as total_requests;
>> >
>> > If I understood correctly that's because the num_reqs is in the bag, as
>> a
>> > result of the
>> > *    (group by_hour_client by hour)*
>> > correct? So I changed the last line to
>> >    *SUM(by_hour_client.num_reqs) as total_requests;*
>> > and it worked (it took a little more than 29 seconds).
>> >
>> > Thanks for your help,
>> > David
>> >
>> >
>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> >
>> > > by_hour_client =
>> > >  foreach
>> > >    (group logs by (hour, client) parallel $p)
>> > >  generate
>> > >    flatten(group) as (hour, client),
>> > >    COUNT(logs) as num_reqs;
>> > >
>> > > by_hour =
>> > >  foreach
>> > >    (group by_hour_client by hour parallel $p2)
>> > >  generate
>> > >    group as hour,
>> > >    COUNT(by_hour_client) as num_dist_clients,
>> > >    SUM(num_reqs) as total_requests;
>> > >
>> > > You can also do this using a nested distinct, but depending on what
>> your
>> > > data looks like, it might be a bad idea, as it can put a lot of
>> pressure
>> > on
>> > > individual reducers that have to do the inner distinct in memory
>> > (although
>> > > they do push part of this up to the mappers):
>> > >
>> > > by_hour =
>> > >  foreach (group logs by hour) {
>> > >   dist_clients = distinct logs.client;
>> > >   generate
>> > >    group as hour,
>> > >    COUNT(dist_clients) as num_dist_clients,
>> > >    COUNT(logs) as total_requests;
>> > > }
>> > >
>> > > D
>> > >
>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
>> > > >wrote:
>> > >
>> > > > I'm analyzing a daily apache log file. I'd like to get the number of
>> > > > requests and of visits by hour.
>> > > >
>> > > > I managed to get the requests, but how do I get the visits?
>> > > >
>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>> > > (line:chararray);
>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>> > > >  FLATTEN(
>> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>> > (\\+\\d{4})\\]
>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>> > > >  ) AS (
>> > > >    client:   chararray,
>> > > >    username: chararray,
>> > > >    date: chararray,
>> > > >    hour: chararray,
>> > > >    minute: chararray,
>> > > >    second: chararray,
>> > > >    timeZone: chararray,
>> > > >    request:  chararray,
>> > > >    statusCode: int,
>> > > >    bytesSent: chararray,
>> > > >    referer:  chararray,
>> > > >    userAgent: chararray,
>> > > >    remoteUser: chararray,
>> > > >    timeTaken: chararray
>> > > > );
>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>> > > > DESCRIBE A;
>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>> > > int,bytesSent:
>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>> > > > chararray,timeTaken: chararray)}}
>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>> > > >
>> > > > How can I now get the distinct count of clients per hour?
>> > > >
>> > > > Thanks for your help!
>> > > >
>> > > > --
>> > > > David Riccitelli
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > > InsideOut10 s.r.l.
>> > > > P.IVA: IT-11381771002
>> > > > Fax: +39 0110708239
>> > > > ---
>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > > > Twitter: ziodave
>> > > > ---
>> > > > Layar Partner Network<
>> > > >
>> > >
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > David Riccitelli
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > > InsideOut10 s.r.l.
>> > > > P.IVA: IT-11381771002
>> > > > Fax: +39 0110708239
>> > > > ---
>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > > > Twitter: ziodave
>> > > > ---
>> > > > Layar Partner Network<
>> > > >
>> > >
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > David Riccitelli
>> >
>> >
>> >
>> ********************************************************************************
>> > InsideOut10 s.r.l.
>> > P.IVA: IT-11381771002
>> > Fax: +39 0110708239
>> > ---
>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > Twitter: ziodave
>> > ---
>> > Layar Partner Network<
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > >
>> >
>> >
>> ********************************************************************************
>> >
>>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
There's something strage in the results however:
(00,129,30096)
(01,91,16487)
(02,57,11686)
(03,41,6041)
(04,30,4882)
(05,33,4154)
(06,65,8031)
(07,66,12260)
(08,95,17924)
(09,131,21187)
(10,162,26607)
(11,155,28503)
(12,146,27863)
(13,152,29130)
(14,159,32784)
(15,150,28898)
(16,143,28973)
(17,169,29024)
(18,199,26585)
(19,182,28803)
(20,224,32511)
(21,232,38584)
(22,225,39924)
(23,191,33606)
(,0,0)


What is the last line:
 (,0,0)
the count is zero, it shouldn't really be there, correct?

(Using pig 0.9.0)


Thanks,
David

On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Right, that should read "by_hour_client.num_reqs".
>
> Don't trust relative measurements you get for small data on a single
> computer in local mode. Things change when you start running on hundreds of
> gigs with real skew on a cluster.
>
> D
>
> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
> >wrote:
>
> > Thanks Dmitriy,
> >
> > The second method took less than 26 secs. on my computer (~550.000
> lines).
> > The first method is giving me the following error:
> >
> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
> > <line 34, column 7> Invalid field projection. Projected field [num_reqs]
> > does not exist in schema:
> >
> >
> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
> >
> > when I try to set the by_hour (after having set the by_hour_client):
> >
> > grunt> by_hour_client =
> > >>  foreach
> > >>    (group logs by (hour, client))
> > >>  generate
> > >>    flatten(group) as (hour, client),
> > >>    COUNT(logs) as num_reqs;
> > grunt> by_hour =
> > >>  foreach
> > >>    (group by_hour_client by hour)
> > >>  generate
> > >>    group as hour,
> > >>    COUNT(by_hour_client) as num_dist_clients,
> > >>    SUM(num_reqs) as total_requests;
> >
> > If I understood correctly that's because the num_reqs is in the bag, as a
> > result of the
> > *    (group by_hour_client by hour)*
> > correct? So I changed the last line to
> >    *SUM(by_hour_client.num_reqs) as total_requests;*
> > and it worked (it took a little more than 29 seconds).
> >
> > Thanks for your help,
> > David
> >
> >
> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > by_hour_client =
> > >  foreach
> > >    (group logs by (hour, client) parallel $p)
> > >  generate
> > >    flatten(group) as (hour, client),
> > >    COUNT(logs) as num_reqs;
> > >
> > > by_hour =
> > >  foreach
> > >    (group by_hour_client by hour parallel $p2)
> > >  generate
> > >    group as hour,
> > >    COUNT(by_hour_client) as num_dist_clients,
> > >    SUM(num_reqs) as total_requests;
> > >
> > > You can also do this using a nested distinct, but depending on what
> your
> > > data looks like, it might be a bad idea, as it can put a lot of
> pressure
> > on
> > > individual reducers that have to do the inner distinct in memory
> > (although
> > > they do push part of this up to the mappers):
> > >
> > > by_hour =
> > >  foreach (group logs by hour) {
> > >   dist_clients = distinct logs.client;
> > >   generate
> > >    group as hour,
> > >    COUNT(dist_clients) as num_dist_clients,
> > >    COUNT(logs) as total_requests;
> > > }
> > >
> > > D
> > >
> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
> > > >wrote:
> > >
> > > > I'm analyzing a daily apache log file. I'd like to get the number of
> > > > requests and of visits by hour.
> > > >
> > > > I managed to get the requests, but how do I get the visits?
> > > >
> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> > > (line:chararray);
> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > > >  FLATTEN(
> > > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
> > (\\+\\d{4})\\]
> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > > >  ) AS (
> > > >    client:   chararray,
> > > >    username: chararray,
> > > >    date: chararray,
> > > >    hour: chararray,
> > > >    minute: chararray,
> > > >    second: chararray,
> > > >    timeZone: chararray,
> > > >    request:  chararray,
> > > >    statusCode: int,
> > > >    bytesSent: chararray,
> > > >    referer:  chararray,
> > > >    userAgent: chararray,
> > > >    remoteUser: chararray,
> > > >    timeTaken: chararray
> > > > );
> > > > grunt> A = GROUP LOGS_BASE BY hour;
> > > > DESCRIBE A;
> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > > > chararray,timeZone: chararray,request: chararray,statusCode:
> > > int,bytesSent:
> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > > > chararray,timeTaken: chararray)}}
> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > > > grunt> C = ORDER B BY hour; -- requests by hour
> > > >
> > > > How can I now get the distinct count of clients per hour?
> > > >
> > > > Thanks for your help!
> > > >
> > > > --
> > > > David Riccitelli
> > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > > InsideOut10 s.r.l.
> > > > P.IVA: IT-11381771002
> > > > Fax: +39 0110708239
> > > > ---
> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > > Twitter: ziodave
> > > > ---
> > > > Layar Partner Network<
> > > >
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > David Riccitelli
> > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > > InsideOut10 s.r.l.
> > > > P.IVA: IT-11381771002
> > > > Fax: +39 0110708239
> > > > ---
> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > > Twitter: ziodave
> > > > ---
> > > > Layar Partner Network<
> > > >
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > >
> > >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>



-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Right, that should read "by_hour_client.num_reqs".

Don't trust relative measurements you get for small data on a single
computer in local mode. Things change when you start running on hundreds of
gigs with real skew on a cluster.

D

On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <da...@insideout.io>wrote:

> Thanks Dmitriy,
>
> The second method took less than 26 secs. on my computer (~550.000 lines).
> The first method is giving me the following error:
>
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
> <line 34, column 7> Invalid field projection. Projected field [num_reqs]
> does not exist in schema:
>
> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>
> when I try to set the by_hour (after having set the by_hour_client):
>
> grunt> by_hour_client =
> >>  foreach
> >>    (group logs by (hour, client))
> >>  generate
> >>    flatten(group) as (hour, client),
> >>    COUNT(logs) as num_reqs;
> grunt> by_hour =
> >>  foreach
> >>    (group by_hour_client by hour)
> >>  generate
> >>    group as hour,
> >>    COUNT(by_hour_client) as num_dist_clients,
> >>    SUM(num_reqs) as total_requests;
>
> If I understood correctly that's because the num_reqs is in the bag, as a
> result of the
> *    (group by_hour_client by hour)*
> correct? So I changed the last line to
>    *SUM(by_hour_client.num_reqs) as total_requests;*
> and it worked (it took a little more than 29 seconds).
>
> Thanks for your help,
> David
>
>
> On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > by_hour_client =
> >  foreach
> >    (group logs by (hour, client) parallel $p)
> >  generate
> >    flatten(group) as (hour, client),
> >    COUNT(logs) as num_reqs;
> >
> > by_hour =
> >  foreach
> >    (group by_hour_client by hour parallel $p2)
> >  generate
> >    group as hour,
> >    COUNT(by_hour_client) as num_dist_clients,
> >    SUM(num_reqs) as total_requests;
> >
> > You can also do this using a nested distinct, but depending on what your
> > data looks like, it might be a bad idea, as it can put a lot of pressure
> on
> > individual reducers that have to do the inner distinct in memory
> (although
> > they do push part of this up to the mappers):
> >
> > by_hour =
> >  foreach (group logs by hour) {
> >   dist_clients = distinct logs.client;
> >   generate
> >    group as hour,
> >    COUNT(dist_clients) as num_dist_clients,
> >    COUNT(logs) as total_requests;
> > }
> >
> > D
> >
> > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
> > >wrote:
> >
> > > I'm analyzing a daily apache log file. I'd like to get the number of
> > > requests and of visits by hour.
> > >
> > > I managed to get the requests, but how do I get the visits?
> > >
> > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> > (line:chararray);
> > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > >  FLATTEN(
> > >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
> (\\+\\d{4})\\]
> > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > >  ) AS (
> > >    client:   chararray,
> > >    username: chararray,
> > >    date: chararray,
> > >    hour: chararray,
> > >    minute: chararray,
> > >    second: chararray,
> > >    timeZone: chararray,
> > >    request:  chararray,
> > >    statusCode: int,
> > >    bytesSent: chararray,
> > >    referer:  chararray,
> > >    userAgent: chararray,
> > >    remoteUser: chararray,
> > >    timeTaken: chararray
> > > );
> > > grunt> A = GROUP LOGS_BASE BY hour;
> > > DESCRIBE A;
> > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > > chararray,timeZone: chararray,request: chararray,statusCode:
> > int,bytesSent:
> > > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > > chararray,timeTaken: chararray)}}
> > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > > grunt> C = ORDER B BY hour; -- requests by hour
> > >
> > > How can I now get the distinct count of clients per hour?
> > >
> > > Thanks for your help!
> > >
> > > --
> > > David Riccitelli
> > >
> > >
> > >
> >
> ********************************************************************************
> > > InsideOut10 s.r.l.
> > > P.IVA: IT-11381771002
> > > Fax: +39 0110708239
> > > ---
> > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > Twitter: ziodave
> > > ---
> > > Layar Partner Network<
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > >
> > >
> > >
> >
> ********************************************************************************
> > >
> > >
> > >
> > >
> > > --
> > > David Riccitelli
> > >
> > >
> > >
> >
> ********************************************************************************
> > > InsideOut10 s.r.l.
> > > P.IVA: IT-11381771002
> > > Fax: +39 0110708239
> > > ---
> > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > Twitter: ziodave
> > > ---
> > > Layar Partner Network<
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > >
> > >
> > >
> >
> ********************************************************************************
> > >
> >
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>

Re: calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
Thanks Dmitriy,

The second method took less than 26 secs. on my computer (~550.000 lines).
The first method is giving me the following error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 34, column 7> Invalid field projection. Projected field [num_reqs]
does not exist in schema:
group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.

when I try to set the by_hour (after having set the by_hour_client):

grunt> by_hour_client =
>>  foreach
>>    (group logs by (hour, client))
>>  generate
>>    flatten(group) as (hour, client),
>>    COUNT(logs) as num_reqs;
grunt> by_hour =
>>  foreach
>>    (group by_hour_client by hour)
>>  generate
>>    group as hour,
>>    COUNT(by_hour_client) as num_dist_clients,
>>    SUM(num_reqs) as total_requests;

If I understood correctly that's because the num_reqs is in the bag, as a
result of the
*    (group by_hour_client by hour)*
correct? So I changed the last line to
    *SUM(by_hour_client.num_reqs) as total_requests;*
and it worked (it took a little more than 29 seconds).

Thanks for your help,
David


On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> by_hour_client =
>  foreach
>    (group logs by (hour, client) parallel $p)
>  generate
>    flatten(group) as (hour, client),
>    COUNT(logs) as num_reqs;
>
> by_hour =
>  foreach
>    (group by_hour_client by hour parallel $p2)
>  generate
>    group as hour,
>    COUNT(by_hour_client) as num_dist_clients,
>    SUM(num_reqs) as total_requests;
>
> You can also do this using a nested distinct, but depending on what your
> data looks like, it might be a bad idea, as it can put a lot of pressure on
> individual reducers that have to do the inner distinct in memory (although
> they do push part of this up to the mappers):
>
> by_hour =
>  foreach (group logs by hour) {
>   dist_clients = distinct logs.client;
>   generate
>    group as hour,
>    COUNT(dist_clients) as num_dist_clients,
>    COUNT(logs) as total_requests;
> }
>
> D
>
> On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
> >wrote:
>
> > I'm analyzing a daily apache log file. I'd like to get the number of
> > requests and of visits by hour.
> >
> > I managed to get the requests, but how do I get the visits?
> >
> > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> (line:chararray);
> > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> >  FLATTEN(
> >    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> >  ) AS (
> >    client:   chararray,
> >    username: chararray,
> >    date: chararray,
> >    hour: chararray,
> >    minute: chararray,
> >    second: chararray,
> >    timeZone: chararray,
> >    request:  chararray,
> >    statusCode: int,
> >    bytesSent: chararray,
> >    referer:  chararray,
> >    userAgent: chararray,
> >    remoteUser: chararray,
> >    timeTaken: chararray
> > );
> > grunt> A = GROUP LOGS_BASE BY hour;
> > DESCRIBE A;
> > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > chararray,timeZone: chararray,request: chararray,statusCode:
> int,bytesSent:
> > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > chararray,timeTaken: chararray)}}
> > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > grunt> C = ORDER B BY hour; -- requests by hour
> >
> > How can I now get the distinct count of clients per hour?
> >
> > Thanks for your help!
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>



-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: calculate requests and visits (nested groups?)

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
by_hour_client =
  foreach
    (group logs by (hour, client) parallel $p)
  generate
    flatten(group) as (hour, client),
    COUNT(logs) as num_reqs;

by_hour =
  foreach
    (group by_hour_client by hour parallel $p2)
  generate
    group as hour,
    COUNT(by_hour_client) as num_dist_clients,
    SUM(num_reqs) as total_requests;

You can also do this using a nested distinct, but depending on what your
data looks like, it might be a bad idea, as it can put a lot of pressure on
individual reducers that have to do the inner distinct in memory (although
they do push part of this up to the mappers):

by_hour =
  foreach (group logs by hour) {
   dist_clients = distinct logs.client;
   generate
    group as hour,
    COUNT(dist_clients) as num_dist_clients,
    COUNT(logs) as total_requests;
}

D

On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <da...@insideout.io>wrote:

> I'm analyzing a daily apache log file. I'd like to get the number of
> requests and of visits by hour.
>
> I managed to get the requests, but how do I get the visits?
>
> grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
> grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>  FLATTEN(
>    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>  ) AS (
>    client:   chararray,
>    username: chararray,
>    date: chararray,
>    hour: chararray,
>    minute: chararray,
>    second: chararray,
>    timeZone: chararray,
>    request:  chararray,
>    statusCode: int,
>    bytesSent: chararray,
>    referer:  chararray,
>    userAgent: chararray,
>    remoteUser: chararray,
>    timeTaken: chararray
> );
> grunt> A = GROUP LOGS_BASE BY hour;
> DESCRIBE A;
> A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> chararray,date: chararray,hour: chararray,minute: chararray,second:
> chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
> chararray,referer: chararray,userAgent: chararray,remoteUser:
> chararray,timeTaken: chararray)}}
> grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> grunt> C = ORDER B BY hour; -- requests by hour
>
> How can I now get the distinct count of clients per hour?
>
> Thanks for your help!
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>

calculate requests and visits (nested groups?)

Posted by David Riccitelli <da...@insideout.io>.
I'm analyzing a daily apache log file. I'd like to get the number of
requests and of visits by hour.

I managed to get the requests, but how do I get the visits?

grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
  FLATTEN(
    REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
\\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
  ) AS (
    client:   chararray,
    username: chararray,
    date: chararray,
    hour: chararray,
    minute: chararray,
    second: chararray,
    timeZone: chararray,
    request:  chararray,
    statusCode: int,
    bytesSent: chararray,
    referer:  chararray,
    userAgent: chararray,
    remoteUser: chararray,
    timeTaken: chararray
);
grunt> A = GROUP LOGS_BASE BY hour;
DESCRIBE A;
A: {group: chararray,LOGS_BASE: {(client: chararray,username:
chararray,date: chararray,hour: chararray,minute: chararray,second:
chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
chararray,referer: chararray,userAgent: chararray,remoteUser:
chararray,timeTaken: chararray)}}
grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
grunt> C = ORDER B BY hour; -- requests by hour

How can I now get the distinct count of clients per hour?

Thanks for your help!

-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************




-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************