You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by David Riccitelli <da...@insideout.io> on 2011/08/19 12:06:14 UTC
calculate requests and visits (nested groups?)
I'm analyzing a daily apache log file. I'd like to get the number of
requests and of visits by hour.
I managed to get the requests, but how do I get the visits?
grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
\\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
) AS (
client: chararray,
username: chararray,
date: chararray,
hour: chararray,
minute: chararray,
second: chararray,
timeZone: chararray,
request: chararray,
statusCode: int,
bytesSent: chararray,
referer: chararray,
userAgent: chararray,
remoteUser: chararray,
timeTaken: chararray
);
grunt> A = GROUP LOGS_BASE BY hour;
DESCRIBE A;
A: {group: chararray,LOGS_BASE: {(client: chararray,username:
chararray,date: chararray,hour: chararray,minute: chararray,second:
chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
chararray,referer: chararray,userAgent: chararray,remoteUser:
chararray,timeTaken: chararray)}}
grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
grunt> C = ORDER B BY hour; -- requests by hour
How can I now get the distinct count of clients per hour?
Thanks for your help!
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
Ok, seems that I've been able to solve it changing this (casting to long, *
(bag{tuple(long)})logs.timeTaken*):
by_hour =
foreach (group logs by hour) {
dist_clients = distinct logs.client;
max_time_taken = logs.timeTaken;
generate
group as hour,
COUNT(dist_clients) as num_dist_clients,
COUNT(logs) as total_requests,
MAX( max_time_taken );
};
to this:
by_hour =
foreach (group logs by hour) {
dist_clients = distinct logs.client;
max_time_taken = (bag{tuple(long)})logs.timeTaken;
generate
group as hour,
COUNT(dist_clients) as num_dist_clients,
COUNT(logs) as total_requests,
MAX( max_time_taken );
};
David
On Fri, Aug 19, 2011 at 7:44 PM, David Riccitelli <da...@insideout.io>wrote:
> Sorry for this long sequence of messages, but I'm posting things as I
> continue testing/investigating.
>
> May be this relevant to my case?
>
> http://www.mail-archive.com/user@pig.apache.org/msg02258.html
>
> Thanks,
> David
>
>
> On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I tried changing this line, from:
>> RAW_LOGS = LOAD
>> '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING
>> TextLoader() AS (line:chararray);
>>
>> to:
>> RAW_LOGS = LOAD
>> '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test' USING
>> PigStorage() AS (line:chararray);
>>
>> It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL
>> that produces the logs schema.
>>
>> Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
>> function?
>>
>> Thanks for your help,
>> David
>>
>> On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> I noticed that this issue arises only if I load the initial data with the
>>> TextLoader() and using the REGEX_EXTRACT_ALL.
>>>
>>> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
>>> REGEX_EXTRACT_ALL), it works.
>>>
>>> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>>>
>>> Does it make sense?
>>>
>>> David
>>>
>>>
>>> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
>>>
>>>> I still can't manage to accomplish my objectives. I'm trying to get now
>>>> the max time taken so, as a test, I do:
>>>> grunt> A = GROUP logs BY client;
>>>>
>>>> then (timeTaken is long):
>>>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>>>
>>>> when I dump it, I get the following error:
>>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>>>> while computing max in Initial
>>>> (...)
>>>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>>>> to java.lang.Long
>>>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>>>
>>>> Initially I thought that I had some timeTaken not compatible with long
>>>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>>>> regular expression.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>> David
>>>>
>>>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>>>>
>>>>> I tried with another log file and that does not happen, so I suppose
>>>>> there's some 'corrupted' line in the one I was testing.
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>>>>
>>>>>> There's something strage in the results however:
>>>>>> (00,129,30096)
>>>>>> (01,91,16487)
>>>>>> (02,57,11686)
>>>>>> (03,41,6041)
>>>>>> (04,30,4882)
>>>>>> (05,33,4154)
>>>>>> (06,65,8031)
>>>>>> (07,66,12260)
>>>>>> (08,95,17924)
>>>>>> (09,131,21187)
>>>>>> (10,162,26607)
>>>>>> (11,155,28503)
>>>>>> (12,146,27863)
>>>>>> (13,152,29130)
>>>>>> (14,159,32784)
>>>>>> (15,150,28898)
>>>>>> (16,143,28973)
>>>>>> (17,169,29024)
>>>>>> (18,199,26585)
>>>>>> (19,182,28803)
>>>>>> (20,224,32511)
>>>>>> (21,232,38584)
>>>>>> (22,225,39924)
>>>>>> (23,191,33606)
>>>>>> (,0,0)
>>>>>>
>>>>>>
>>>>>> What is the last line:
>>>>>> (,0,0)
>>>>>> the count is zero, it shouldn't really be there, correct?
>>>>>>
>>>>>> (Using pig 0.9.0)
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>>>
>>>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>>>
>>>>>>> Don't trust relative measurements you get for small data on a single
>>>>>>> computer in local mode. Things change when you start running on
>>>>>>> hundreds of
>>>>>>> gigs with real skew on a cluster.
>>>>>>>
>>>>>>> D
>>>>>>>
>>>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <
>>>>>>> david@insideout.io>wrote:
>>>>>>>
>>>>>>> > Thanks Dmitriy,
>>>>>>> >
>>>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>>>> lines).
>>>>>>> > The first method is giving me the following error:
>>>>>>> >
>>>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>>>> [num_reqs]
>>>>>>> > does not exist in schema:
>>>>>>> >
>>>>>>> >
>>>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>>>> >
>>>>>>> > when I try to set the by_hour (after having set the
>>>>>>> by_hour_client):
>>>>>>> >
>>>>>>> > grunt> by_hour_client =
>>>>>>> > >> foreach
>>>>>>> > >> (group logs by (hour, client))
>>>>>>> > >> generate
>>>>>>> > >> flatten(group) as (hour, client),
>>>>>>> > >> COUNT(logs) as num_reqs;
>>>>>>> > grunt> by_hour =
>>>>>>> > >> foreach
>>>>>>> > >> (group by_hour_client by hour)
>>>>>>> > >> generate
>>>>>>> > >> group as hour,
>>>>>>> > >> COUNT(by_hour_client) as num_dist_clients,
>>>>>>> > >> SUM(num_reqs) as total_requests;
>>>>>>> >
>>>>>>> > If I understood correctly that's because the num_reqs is in the
>>>>>>> bag, as a
>>>>>>> > result of the
>>>>>>> > * (group by_hour_client by hour)*
>>>>>>> > correct? So I changed the last line to
>>>>>>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>>>> > and it worked (it took a little more than 29 seconds).
>>>>>>> >
>>>>>>> > Thanks for your help,
>>>>>>> > David
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <
>>>>>>> dvryaboy@gmail.com>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> > > by_hour_client =
>>>>>>> > > foreach
>>>>>>> > > (group logs by (hour, client) parallel $p)
>>>>>>> > > generate
>>>>>>> > > flatten(group) as (hour, client),
>>>>>>> > > COUNT(logs) as num_reqs;
>>>>>>> > >
>>>>>>> > > by_hour =
>>>>>>> > > foreach
>>>>>>> > > (group by_hour_client by hour parallel $p2)
>>>>>>> > > generate
>>>>>>> > > group as hour,
>>>>>>> > > COUNT(by_hour_client) as num_dist_clients,
>>>>>>> > > SUM(num_reqs) as total_requests;
>>>>>>> > >
>>>>>>> > > You can also do this using a nested distinct, but depending on
>>>>>>> what your
>>>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>>>> pressure
>>>>>>> > on
>>>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>>>> > (although
>>>>>>> > > they do push part of this up to the mappers):
>>>>>>> > >
>>>>>>> > > by_hour =
>>>>>>> > > foreach (group logs by hour) {
>>>>>>> > > dist_clients = distinct logs.client;
>>>>>>> > > generate
>>>>>>> > > group as hour,
>>>>>>> > > COUNT(dist_clients) as num_dist_clients,
>>>>>>> > > COUNT(logs) as total_requests;
>>>>>>> > > }
>>>>>>> > >
>>>>>>> > > D
>>>>>>> > >
>>>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>>>> david@insideout.io
>>>>>>> > > >wrote:
>>>>>>> > >
>>>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the
>>>>>>> number of
>>>>>>> > > > requests and of visits by hour.
>>>>>>> > > >
>>>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>>>> > > >
>>>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>>>> > > (line:chararray);
>>>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>>>> > > > FLATTEN(
>>>>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>>>> > (\\+\\d{4})\\]
>>>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>>>> > > > ) AS (
>>>>>>> > > > client: chararray,
>>>>>>> > > > username: chararray,
>>>>>>> > > > date: chararray,
>>>>>>> > > > hour: chararray,
>>>>>>> > > > minute: chararray,
>>>>>>> > > > second: chararray,
>>>>>>> > > > timeZone: chararray,
>>>>>>> > > > request: chararray,
>>>>>>> > > > statusCode: int,
>>>>>>> > > > bytesSent: chararray,
>>>>>>> > > > referer: chararray,
>>>>>>> > > > userAgent: chararray,
>>>>>>> > > > remoteUser: chararray,
>>>>>>> > > > timeTaken: chararray
>>>>>>> > > > );
>>>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>>>> > > > DESCRIBE A;
>>>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>>>> chararray,second:
>>>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>>>> > > int,bytesSent:
>>>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>>>> > > > chararray,timeTaken: chararray)}}
>>>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>>>> > > >
>>>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>>>> > > >
>>>>>>> > > > Thanks for your help!
>>>>>>> > > >
>>>>>>> > > > --
>>>>>>> > > > David Riccitelli
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > > InsideOut10 s.r.l.
>>>>>>> > > > P.IVA: IT-11381771002
>>>>>>> > > > Fax: +39 0110708239
>>>>>>> > > > ---
>>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > > > Twitter: ziodave
>>>>>>> > > > ---
>>>>>>> > > > Layar Partner Network<
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > --
>>>>>>> > > > David Riccitelli
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > > InsideOut10 s.r.l.
>>>>>>> > > > P.IVA: IT-11381771002
>>>>>>> > > > Fax: +39 0110708239
>>>>>>> > > > ---
>>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > > > Twitter: ziodave
>>>>>>> > > > ---
>>>>>>> > > > Layar Partner Network<
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > > > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > David Riccitelli
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> > InsideOut10 s.r.l.
>>>>>>> > P.IVA: IT-11381771002
>>>>>>> > Fax: +39 0110708239
>>>>>>> > ---
>>>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>>> > Twitter: ziodave
>>>>>>> > ---
>>>>>>> > Layar Partner Network<
>>>>>>> >
>>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>>> > >
>>>>>>> >
>>>>>>> >
>>>>>>> ********************************************************************************
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> David Riccitelli
>>>>>>
>>>>>>
>>>>>> ********************************************************************************
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>>
>>>>>> ********************************************************************************
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Riccitelli
>>>>>
>>>>>
>>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>
>>>>> ********************************************************************************
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
Sorry for this long sequence of messages, but I'm posting things as I
continue testing/investigating.
May be this relevant to my case?
http://www.mail-archive.com/user@pig.apache.org/msg02258.html
Thanks,
David
On Fri, Aug 19, 2011 at 7:31 PM, David Riccitelli <da...@insideout.io>wrote:
> I tried changing this line, from:
> RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
> USING TextLoader() AS (line:chararray);
>
> to:
> RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
> USING PigStorage() AS (line:chararray);
>
> It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL
> that produces the logs schema.
>
> Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
> function?
>
> Thanks for your help,
> David
>
> On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I noticed that this issue arises only if I load the initial data with the
>> TextLoader() and using the REGEX_EXTRACT_ALL.
>>
>> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
>> REGEX_EXTRACT_ALL), it works.
>>
>> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>>
>> Does it make sense?
>>
>> David
>>
>>
>> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> I still can't manage to accomplish my objectives. I'm trying to get now
>>> the max time taken so, as a test, I do:
>>> grunt> A = GROUP logs BY client;
>>>
>>> then (timeTaken is long):
>>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>>
>>> when I dump it, I get the following error:
>>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>>> while computing max in Initial
>>> (...)
>>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>>> to java.lang.Long
>>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>>
>>> Initially I thought that I had some timeTaken not compatible with long
>>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>>> regular expression.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>> David
>>>
>>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>>>
>>>> I tried with another log file and that does not happen, so I suppose
>>>> there's some 'corrupted' line in the one I was testing.
>>>>
>>>>
>>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>>>
>>>>> There's something strage in the results however:
>>>>> (00,129,30096)
>>>>> (01,91,16487)
>>>>> (02,57,11686)
>>>>> (03,41,6041)
>>>>> (04,30,4882)
>>>>> (05,33,4154)
>>>>> (06,65,8031)
>>>>> (07,66,12260)
>>>>> (08,95,17924)
>>>>> (09,131,21187)
>>>>> (10,162,26607)
>>>>> (11,155,28503)
>>>>> (12,146,27863)
>>>>> (13,152,29130)
>>>>> (14,159,32784)
>>>>> (15,150,28898)
>>>>> (16,143,28973)
>>>>> (17,169,29024)
>>>>> (18,199,26585)
>>>>> (19,182,28803)
>>>>> (20,224,32511)
>>>>> (21,232,38584)
>>>>> (22,225,39924)
>>>>> (23,191,33606)
>>>>> (,0,0)
>>>>>
>>>>>
>>>>> What is the last line:
>>>>> (,0,0)
>>>>> the count is zero, it shouldn't really be there, correct?
>>>>>
>>>>> (Using pig 0.9.0)
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>>
>>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>>
>>>>>> Don't trust relative measurements you get for small data on a single
>>>>>> computer in local mode. Things change when you start running on
>>>>>> hundreds of
>>>>>> gigs with real skew on a cluster.
>>>>>>
>>>>>> D
>>>>>>
>>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>>>>> >wrote:
>>>>>>
>>>>>> > Thanks Dmitriy,
>>>>>> >
>>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>>> lines).
>>>>>> > The first method is giving me the following error:
>>>>>> >
>>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>>> [num_reqs]
>>>>>> > does not exist in schema:
>>>>>> >
>>>>>> >
>>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>>> >
>>>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>>>> >
>>>>>> > grunt> by_hour_client =
>>>>>> > >> foreach
>>>>>> > >> (group logs by (hour, client))
>>>>>> > >> generate
>>>>>> > >> flatten(group) as (hour, client),
>>>>>> > >> COUNT(logs) as num_reqs;
>>>>>> > grunt> by_hour =
>>>>>> > >> foreach
>>>>>> > >> (group by_hour_client by hour)
>>>>>> > >> generate
>>>>>> > >> group as hour,
>>>>>> > >> COUNT(by_hour_client) as num_dist_clients,
>>>>>> > >> SUM(num_reqs) as total_requests;
>>>>>> >
>>>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>>>> as a
>>>>>> > result of the
>>>>>> > * (group by_hour_client by hour)*
>>>>>> > correct? So I changed the last line to
>>>>>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>>> > and it worked (it took a little more than 29 seconds).
>>>>>> >
>>>>>> > Thanks for your help,
>>>>>> > David
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
>>>>>> >
>>>>>> > wrote:
>>>>>> >
>>>>>> > > by_hour_client =
>>>>>> > > foreach
>>>>>> > > (group logs by (hour, client) parallel $p)
>>>>>> > > generate
>>>>>> > > flatten(group) as (hour, client),
>>>>>> > > COUNT(logs) as num_reqs;
>>>>>> > >
>>>>>> > > by_hour =
>>>>>> > > foreach
>>>>>> > > (group by_hour_client by hour parallel $p2)
>>>>>> > > generate
>>>>>> > > group as hour,
>>>>>> > > COUNT(by_hour_client) as num_dist_clients,
>>>>>> > > SUM(num_reqs) as total_requests;
>>>>>> > >
>>>>>> > > You can also do this using a nested distinct, but depending on
>>>>>> what your
>>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>>> pressure
>>>>>> > on
>>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>>> > (although
>>>>>> > > they do push part of this up to the mappers):
>>>>>> > >
>>>>>> > > by_hour =
>>>>>> > > foreach (group logs by hour) {
>>>>>> > > dist_clients = distinct logs.client;
>>>>>> > > generate
>>>>>> > > group as hour,
>>>>>> > > COUNT(dist_clients) as num_dist_clients,
>>>>>> > > COUNT(logs) as total_requests;
>>>>>> > > }
>>>>>> > >
>>>>>> > > D
>>>>>> > >
>>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>>> david@insideout.io
>>>>>> > > >wrote:
>>>>>> > >
>>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the
>>>>>> number of
>>>>>> > > > requests and of visits by hour.
>>>>>> > > >
>>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>>> > > >
>>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>>> > > (line:chararray);
>>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>>> > > > FLATTEN(
>>>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>>> > (\\+\\d{4})\\]
>>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>>> > > > ) AS (
>>>>>> > > > client: chararray,
>>>>>> > > > username: chararray,
>>>>>> > > > date: chararray,
>>>>>> > > > hour: chararray,
>>>>>> > > > minute: chararray,
>>>>>> > > > second: chararray,
>>>>>> > > > timeZone: chararray,
>>>>>> > > > request: chararray,
>>>>>> > > > statusCode: int,
>>>>>> > > > bytesSent: chararray,
>>>>>> > > > referer: chararray,
>>>>>> > > > userAgent: chararray,
>>>>>> > > > remoteUser: chararray,
>>>>>> > > > timeTaken: chararray
>>>>>> > > > );
>>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>>> > > > DESCRIBE A;
>>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>>> chararray,second:
>>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>>> > > int,bytesSent:
>>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>>> > > > chararray,timeTaken: chararray)}}
>>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>>> > > >
>>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>>> > > >
>>>>>> > > > Thanks for your help!
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > David Riccitelli
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > > InsideOut10 s.r.l.
>>>>>> > > > P.IVA: IT-11381771002
>>>>>> > > > Fax: +39 0110708239
>>>>>> > > > ---
>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > > > Twitter: ziodave
>>>>>> > > > ---
>>>>>> > > > Layar Partner Network<
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > --
>>>>>> > > > David Riccitelli
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > > InsideOut10 s.r.l.
>>>>>> > > > P.IVA: IT-11381771002
>>>>>> > > > Fax: +39 0110708239
>>>>>> > > > ---
>>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > > > Twitter: ziodave
>>>>>> > > > ---
>>>>>> > > > Layar Partner Network<
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > David Riccitelli
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> > InsideOut10 s.r.l.
>>>>>> > P.IVA: IT-11381771002
>>>>>> > Fax: +39 0110708239
>>>>>> > ---
>>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> > Twitter: ziodave
>>>>>> > ---
>>>>>> > Layar Partner Network<
>>>>>> >
>>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> ********************************************************************************
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Riccitelli
>>>>>
>>>>>
>>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>
>>>>> ********************************************************************************
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
I tried changing this line, from:
RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
USING TextLoader() AS (line:chararray);
to:
RAW_LOGS = LOAD '/Users/david/Documents/Work/OTT-Tunisiana/access_log_test'
USING PigStorage() AS (line:chararray);
It does not fix the issue, as it is depended from the REGEX_EXTRACT_ALL that
produces the logs schema.
Is there any incompatibility between the REGEX_EXTRACT_ALL and the MAX
function?
Thanks for your help,
David
On Fri, Aug 19, 2011 at 7:28 PM, David Riccitelli <da...@insideout.io>wrote:
> I noticed that this issue arises only if I load the initial data with the
> TextLoader() and using the REGEX_EXTRACT_ALL.
>
> If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
> REGEX_EXTRACT_ALL), it works.
>
> But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
>
> Does it make sense?
>
> David
>
>
> On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I still can't manage to accomplish my objectives. I'm trying to get now
>> the max time taken so, as a test, I do:
>> grunt> A = GROUP logs BY client;
>>
>> then (timeTaken is long):
>> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>>
>> when I dump it, I get the following error:
>> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
>> while computing max in Initial
>> (...)
>> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast
>> to java.lang.Long
>> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>>
>> Initially I thought that I had some timeTaken not compatible with long
>> data type, but I checked and re-checked. I also get the timeTaken as \d+
>> regular expression.
>>
>> What am I doing wrong?
>>
>> Thanks!
>> David
>>
>> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> I tried with another log file and that does not happen, so I suppose
>>> there's some 'corrupted' line in the one I was testing.
>>>
>>>
>>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>>
>>>> There's something strage in the results however:
>>>> (00,129,30096)
>>>> (01,91,16487)
>>>> (02,57,11686)
>>>> (03,41,6041)
>>>> (04,30,4882)
>>>> (05,33,4154)
>>>> (06,65,8031)
>>>> (07,66,12260)
>>>> (08,95,17924)
>>>> (09,131,21187)
>>>> (10,162,26607)
>>>> (11,155,28503)
>>>> (12,146,27863)
>>>> (13,152,29130)
>>>> (14,159,32784)
>>>> (15,150,28898)
>>>> (16,143,28973)
>>>> (17,169,29024)
>>>> (18,199,26585)
>>>> (19,182,28803)
>>>> (20,224,32511)
>>>> (21,232,38584)
>>>> (22,225,39924)
>>>> (23,191,33606)
>>>> (,0,0)
>>>>
>>>>
>>>> What is the last line:
>>>> (,0,0)
>>>> the count is zero, it shouldn't really be there, correct?
>>>>
>>>> (Using pig 0.9.0)
>>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>
>>>>> Right, that should read "by_hour_client.num_reqs".
>>>>>
>>>>> Don't trust relative measurements you get for small data on a single
>>>>> computer in local mode. Things change when you start running on
>>>>> hundreds of
>>>>> gigs with real skew on a cluster.
>>>>>
>>>>> D
>>>>>
>>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>>>> >wrote:
>>>>>
>>>>> > Thanks Dmitriy,
>>>>> >
>>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>>> lines).
>>>>> > The first method is giving me the following error:
>>>>> >
>>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>>> [num_reqs]
>>>>> > does not exist in schema:
>>>>> >
>>>>> >
>>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>>> >
>>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>>> >
>>>>> > grunt> by_hour_client =
>>>>> > >> foreach
>>>>> > >> (group logs by (hour, client))
>>>>> > >> generate
>>>>> > >> flatten(group) as (hour, client),
>>>>> > >> COUNT(logs) as num_reqs;
>>>>> > grunt> by_hour =
>>>>> > >> foreach
>>>>> > >> (group by_hour_client by hour)
>>>>> > >> generate
>>>>> > >> group as hour,
>>>>> > >> COUNT(by_hour_client) as num_dist_clients,
>>>>> > >> SUM(num_reqs) as total_requests;
>>>>> >
>>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>>> as a
>>>>> > result of the
>>>>> > * (group by_hour_client by hour)*
>>>>> > correct? So I changed the last line to
>>>>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>>>>> > and it worked (it took a little more than 29 seconds).
>>>>> >
>>>>> > Thanks for your help,
>>>>> > David
>>>>> >
>>>>> >
>>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> > > by_hour_client =
>>>>> > > foreach
>>>>> > > (group logs by (hour, client) parallel $p)
>>>>> > > generate
>>>>> > > flatten(group) as (hour, client),
>>>>> > > COUNT(logs) as num_reqs;
>>>>> > >
>>>>> > > by_hour =
>>>>> > > foreach
>>>>> > > (group by_hour_client by hour parallel $p2)
>>>>> > > generate
>>>>> > > group as hour,
>>>>> > > COUNT(by_hour_client) as num_dist_clients,
>>>>> > > SUM(num_reqs) as total_requests;
>>>>> > >
>>>>> > > You can also do this using a nested distinct, but depending on what
>>>>> your
>>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>>> pressure
>>>>> > on
>>>>> > > individual reducers that have to do the inner distinct in memory
>>>>> > (although
>>>>> > > they do push part of this up to the mappers):
>>>>> > >
>>>>> > > by_hour =
>>>>> > > foreach (group logs by hour) {
>>>>> > > dist_clients = distinct logs.client;
>>>>> > > generate
>>>>> > > group as hour,
>>>>> > > COUNT(dist_clients) as num_dist_clients,
>>>>> > > COUNT(logs) as total_requests;
>>>>> > > }
>>>>> > >
>>>>> > > D
>>>>> > >
>>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>>> david@insideout.io
>>>>> > > >wrote:
>>>>> > >
>>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>>>> of
>>>>> > > > requests and of visits by hour.
>>>>> > > >
>>>>> > > > I managed to get the requests, but how do I get the visits?
>>>>> > > >
>>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>>> > > (line:chararray);
>>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>>> > > > FLATTEN(
>>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>>> > (\\+\\d{4})\\]
>>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>>> > > > ) AS (
>>>>> > > > client: chararray,
>>>>> > > > username: chararray,
>>>>> > > > date: chararray,
>>>>> > > > hour: chararray,
>>>>> > > > minute: chararray,
>>>>> > > > second: chararray,
>>>>> > > > timeZone: chararray,
>>>>> > > > request: chararray,
>>>>> > > > statusCode: int,
>>>>> > > > bytesSent: chararray,
>>>>> > > > referer: chararray,
>>>>> > > > userAgent: chararray,
>>>>> > > > remoteUser: chararray,
>>>>> > > > timeTaken: chararray
>>>>> > > > );
>>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>>> > > > DESCRIBE A;
>>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>>> chararray,second:
>>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>>> > > int,bytesSent:
>>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>>> > > > chararray,timeTaken: chararray)}}
>>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>>> > > >
>>>>> > > > How can I now get the distinct count of clients per hour?
>>>>> > > >
>>>>> > > > Thanks for your help!
>>>>> > > >
>>>>> > > > --
>>>>> > > > David Riccitelli
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > > InsideOut10 s.r.l.
>>>>> > > > P.IVA: IT-11381771002
>>>>> > > > Fax: +39 0110708239
>>>>> > > > ---
>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > > > Twitter: ziodave
>>>>> > > > ---
>>>>> > > > Layar Partner Network<
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > > --
>>>>> > > > David Riccitelli
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > > InsideOut10 s.r.l.
>>>>> > > > P.IVA: IT-11381771002
>>>>> > > > Fax: +39 0110708239
>>>>> > > > ---
>>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > > > Twitter: ziodave
>>>>> > > > ---
>>>>> > > > Layar Partner Network<
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> ********************************************************************************
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > David Riccitelli
>>>>> >
>>>>> >
>>>>> >
>>>>> ********************************************************************************
>>>>> > InsideOut10 s.r.l.
>>>>> > P.IVA: IT-11381771002
>>>>> > Fax: +39 0110708239
>>>>> > ---
>>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> > Twitter: ziodave
>>>>> > ---
>>>>> > Layar Partner Network<
>>>>> >
>>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>> > >
>>>>> >
>>>>> >
>>>>> ********************************************************************************
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> David Riccitelli
>>>>
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
I noticed that this issue arises only if I load the initial data with the
TextLoader() and using the REGEX_EXTRACT_ALL.
If I use the PigStorage (splitting spaces, not using RegExp, i.e. w/o
REGEX_EXTRACT_ALL), it works.
But I need the REGEX_EXTRACT_ALL in order to correctly parse the lines...
Does it make sense?
David
On Fri, Aug 19, 2011 at 6:02 PM, David Riccitelli <da...@insideout.io>wrote:
> I still can't manage to accomplish my objectives. I'm trying to get now the
> max time taken so, as a test, I do:
> grunt> A = GROUP logs BY client;
>
> then (timeTaken is long):
> B = FOREACH A GENERATE group, MAX( logs.timeTaken );
>
> when I dump it, I get the following error:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
> while computing max in Initial
> (...)
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
> java.lang.Long
> at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
>
> Initially I thought that I had some timeTaken not compatible with long data
> type, but I checked and re-checked. I also get the timeTaken as \d+ regular
> expression.
>
> What am I doing wrong?
>
> Thanks!
> David
>
> On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> I tried with another log file and that does not happen, so I suppose
>> there's some 'corrupted' line in the one I was testing.
>>
>>
>> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>>
>>> There's something strage in the results however:
>>> (00,129,30096)
>>> (01,91,16487)
>>> (02,57,11686)
>>> (03,41,6041)
>>> (04,30,4882)
>>> (05,33,4154)
>>> (06,65,8031)
>>> (07,66,12260)
>>> (08,95,17924)
>>> (09,131,21187)
>>> (10,162,26607)
>>> (11,155,28503)
>>> (12,146,27863)
>>> (13,152,29130)
>>> (14,159,32784)
>>> (15,150,28898)
>>> (16,143,28973)
>>> (17,169,29024)
>>> (18,199,26585)
>>> (19,182,28803)
>>> (20,224,32511)
>>> (21,232,38584)
>>> (22,225,39924)
>>> (23,191,33606)
>>> (,0,0)
>>>
>>>
>>> What is the last line:
>>> (,0,0)
>>> the count is zero, it shouldn't really be there, correct?
>>>
>>> (Using pig 0.9.0)
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>
>>>> Right, that should read "by_hour_client.num_reqs".
>>>>
>>>> Don't trust relative measurements you get for small data on a single
>>>> computer in local mode. Things change when you start running on hundreds
>>>> of
>>>> gigs with real skew on a cluster.
>>>>
>>>> D
>>>>
>>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>>> >wrote:
>>>>
>>>> > Thanks Dmitriy,
>>>> >
>>>> > The second method took less than 26 secs. on my computer (~550.000
>>>> lines).
>>>> > The first method is giving me the following error:
>>>> >
>>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>>> > <line 34, column 7> Invalid field projection. Projected field
>>>> [num_reqs]
>>>> > does not exist in schema:
>>>> >
>>>> >
>>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>>> >
>>>> > when I try to set the by_hour (after having set the by_hour_client):
>>>> >
>>>> > grunt> by_hour_client =
>>>> > >> foreach
>>>> > >> (group logs by (hour, client))
>>>> > >> generate
>>>> > >> flatten(group) as (hour, client),
>>>> > >> COUNT(logs) as num_reqs;
>>>> > grunt> by_hour =
>>>> > >> foreach
>>>> > >> (group by_hour_client by hour)
>>>> > >> generate
>>>> > >> group as hour,
>>>> > >> COUNT(by_hour_client) as num_dist_clients,
>>>> > >> SUM(num_reqs) as total_requests;
>>>> >
>>>> > If I understood correctly that's because the num_reqs is in the bag,
>>>> as a
>>>> > result of the
>>>> > * (group by_hour_client by hour)*
>>>> > correct? So I changed the last line to
>>>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>>>> > and it worked (it took a little more than 29 seconds).
>>>> >
>>>> > Thanks for your help,
>>>> > David
>>>> >
>>>> >
>>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > by_hour_client =
>>>> > > foreach
>>>> > > (group logs by (hour, client) parallel $p)
>>>> > > generate
>>>> > > flatten(group) as (hour, client),
>>>> > > COUNT(logs) as num_reqs;
>>>> > >
>>>> > > by_hour =
>>>> > > foreach
>>>> > > (group by_hour_client by hour parallel $p2)
>>>> > > generate
>>>> > > group as hour,
>>>> > > COUNT(by_hour_client) as num_dist_clients,
>>>> > > SUM(num_reqs) as total_requests;
>>>> > >
>>>> > > You can also do this using a nested distinct, but depending on what
>>>> your
>>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>>> pressure
>>>> > on
>>>> > > individual reducers that have to do the inner distinct in memory
>>>> > (although
>>>> > > they do push part of this up to the mappers):
>>>> > >
>>>> > > by_hour =
>>>> > > foreach (group logs by hour) {
>>>> > > dist_clients = distinct logs.client;
>>>> > > generate
>>>> > > group as hour,
>>>> > > COUNT(dist_clients) as num_dist_clients,
>>>> > > COUNT(logs) as total_requests;
>>>> > > }
>>>> > >
>>>> > > D
>>>> > >
>>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>>> david@insideout.io
>>>> > > >wrote:
>>>> > >
>>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>>> of
>>>> > > > requests and of visits by hour.
>>>> > > >
>>>> > > > I managed to get the requests, but how do I get the visits?
>>>> > > >
>>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>>> > > (line:chararray);
>>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>>> > > > FLATTEN(
>>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>>> > (\\+\\d{4})\\]
>>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>>> > > > ) AS (
>>>> > > > client: chararray,
>>>> > > > username: chararray,
>>>> > > > date: chararray,
>>>> > > > hour: chararray,
>>>> > > > minute: chararray,
>>>> > > > second: chararray,
>>>> > > > timeZone: chararray,
>>>> > > > request: chararray,
>>>> > > > statusCode: int,
>>>> > > > bytesSent: chararray,
>>>> > > > referer: chararray,
>>>> > > > userAgent: chararray,
>>>> > > > remoteUser: chararray,
>>>> > > > timeTaken: chararray
>>>> > > > );
>>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>>> > > > DESCRIBE A;
>>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>>> > > > chararray,date: chararray,hour: chararray,minute:
>>>> chararray,second:
>>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>>> > > int,bytesSent:
>>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>>> > > > chararray,timeTaken: chararray)}}
>>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>>> > > >
>>>> > > > How can I now get the distinct count of clients per hour?
>>>> > > >
>>>> > > > Thanks for your help!
>>>> > > >
>>>> > > > --
>>>> > > > David Riccitelli
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > > InsideOut10 s.r.l.
>>>> > > > P.IVA: IT-11381771002
>>>> > > > Fax: +39 0110708239
>>>> > > > ---
>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > > > Twitter: ziodave
>>>> > > > ---
>>>> > > > Layar Partner Network<
>>>> > > >
>>>> > >
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > --
>>>> > > > David Riccitelli
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > > InsideOut10 s.r.l.
>>>> > > > P.IVA: IT-11381771002
>>>> > > > Fax: +39 0110708239
>>>> > > > ---
>>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > > > Twitter: ziodave
>>>> > > > ---
>>>> > > > Layar Partner Network<
>>>> > > >
>>>> > >
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>> ********************************************************************************
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > David Riccitelli
>>>> >
>>>> >
>>>> >
>>>> ********************************************************************************
>>>> > InsideOut10 s.r.l.
>>>> > P.IVA: IT-11381771002
>>>> > Fax: +39 0110708239
>>>> > ---
>>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> > Twitter: ziodave
>>>> > ---
>>>> > Layar Partner Network<
>>>> >
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> > >
>>>> >
>>>> >
>>>> ********************************************************************************
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
I still can't manage to accomplish my objectives. I'm trying to get now the
max time taken so, as a test, I do:
grunt> A = GROUP logs BY client;
then (timeTaken is long):
B = FOREACH A GENERATE group, MAX( logs.timeTaken );
when I dump it, I get the following error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error
while computing max in Initial
(...)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Long
at org.apache.pig.builtin.LongMax$Initial.exec(LongMax.java:76)
Initially I thought that I had some timeTaken not compatible with long data
type, but I checked and re-checked. I also get the timeTaken as \d+ regular
expression.
What am I doing wrong?
Thanks!
David
On Fri, Aug 19, 2011 at 5:25 PM, David Riccitelli <da...@insideout.io>wrote:
> I tried with another log file and that does not happen, so I suppose
> there's some 'corrupted' line in the one I was testing.
>
>
> On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
>
>> There's something strage in the results however:
>> (00,129,30096)
>> (01,91,16487)
>> (02,57,11686)
>> (03,41,6041)
>> (04,30,4882)
>> (05,33,4154)
>> (06,65,8031)
>> (07,66,12260)
>> (08,95,17924)
>> (09,131,21187)
>> (10,162,26607)
>> (11,155,28503)
>> (12,146,27863)
>> (13,152,29130)
>> (14,159,32784)
>> (15,150,28898)
>> (16,143,28973)
>> (17,169,29024)
>> (18,199,26585)
>> (19,182,28803)
>> (20,224,32511)
>> (21,232,38584)
>> (22,225,39924)
>> (23,191,33606)
>> (,0,0)
>>
>>
>> What is the last line:
>> (,0,0)
>> the count is zero, it shouldn't really be there, correct?
>>
>> (Using pig 0.9.0)
>>
>>
>> Thanks,
>> David
>>
>> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>
>>> Right, that should read "by_hour_client.num_reqs".
>>>
>>> Don't trust relative measurements you get for small data on a single
>>> computer in local mode. Things change when you start running on hundreds
>>> of
>>> gigs with real skew on a cluster.
>>>
>>> D
>>>
>>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>>> >wrote:
>>>
>>> > Thanks Dmitriy,
>>> >
>>> > The second method took less than 26 secs. on my computer (~550.000
>>> lines).
>>> > The first method is giving me the following error:
>>> >
>>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>>> > <line 34, column 7> Invalid field projection. Projected field
>>> [num_reqs]
>>> > does not exist in schema:
>>> >
>>> >
>>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>>> >
>>> > when I try to set the by_hour (after having set the by_hour_client):
>>> >
>>> > grunt> by_hour_client =
>>> > >> foreach
>>> > >> (group logs by (hour, client))
>>> > >> generate
>>> > >> flatten(group) as (hour, client),
>>> > >> COUNT(logs) as num_reqs;
>>> > grunt> by_hour =
>>> > >> foreach
>>> > >> (group by_hour_client by hour)
>>> > >> generate
>>> > >> group as hour,
>>> > >> COUNT(by_hour_client) as num_dist_clients,
>>> > >> SUM(num_reqs) as total_requests;
>>> >
>>> > If I understood correctly that's because the num_reqs is in the bag, as
>>> a
>>> > result of the
>>> > * (group by_hour_client by hour)*
>>> > correct? So I changed the last line to
>>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>>> > and it worked (it took a little more than 29 seconds).
>>> >
>>> > Thanks for your help,
>>> > David
>>> >
>>> >
>>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>>> > wrote:
>>> >
>>> > > by_hour_client =
>>> > > foreach
>>> > > (group logs by (hour, client) parallel $p)
>>> > > generate
>>> > > flatten(group) as (hour, client),
>>> > > COUNT(logs) as num_reqs;
>>> > >
>>> > > by_hour =
>>> > > foreach
>>> > > (group by_hour_client by hour parallel $p2)
>>> > > generate
>>> > > group as hour,
>>> > > COUNT(by_hour_client) as num_dist_clients,
>>> > > SUM(num_reqs) as total_requests;
>>> > >
>>> > > You can also do this using a nested distinct, but depending on what
>>> your
>>> > > data looks like, it might be a bad idea, as it can put a lot of
>>> pressure
>>> > on
>>> > > individual reducers that have to do the inner distinct in memory
>>> > (although
>>> > > they do push part of this up to the mappers):
>>> > >
>>> > > by_hour =
>>> > > foreach (group logs by hour) {
>>> > > dist_clients = distinct logs.client;
>>> > > generate
>>> > > group as hour,
>>> > > COUNT(dist_clients) as num_dist_clients,
>>> > > COUNT(logs) as total_requests;
>>> > > }
>>> > >
>>> > > D
>>> > >
>>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <
>>> david@insideout.io
>>> > > >wrote:
>>> > >
>>> > > > I'm analyzing a daily apache log file. I'd like to get the number
>>> of
>>> > > > requests and of visits by hour.
>>> > > >
>>> > > > I managed to get the requests, but how do I get the visits?
>>> > > >
>>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>>> > > (line:chararray);
>>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>>> > > > FLATTEN(
>>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>>> > (\\+\\d{4})\\]
>>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>>> > > > ) AS (
>>> > > > client: chararray,
>>> > > > username: chararray,
>>> > > > date: chararray,
>>> > > > hour: chararray,
>>> > > > minute: chararray,
>>> > > > second: chararray,
>>> > > > timeZone: chararray,
>>> > > > request: chararray,
>>> > > > statusCode: int,
>>> > > > bytesSent: chararray,
>>> > > > referer: chararray,
>>> > > > userAgent: chararray,
>>> > > > remoteUser: chararray,
>>> > > > timeTaken: chararray
>>> > > > );
>>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>>> > > > DESCRIBE A;
>>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
>>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>>> > > int,bytesSent:
>>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>>> > > > chararray,timeTaken: chararray)}}
>>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>>> > > >
>>> > > > How can I now get the distinct count of clients per hour?
>>> > > >
>>> > > > Thanks for your help!
>>> > > >
>>> > > > --
>>> > > > David Riccitelli
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > > InsideOut10 s.r.l.
>>> > > > P.IVA: IT-11381771002
>>> > > > Fax: +39 0110708239
>>> > > > ---
>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > > > Twitter: ziodave
>>> > > > ---
>>> > > > Layar Partner Network<
>>> > > >
>>> > >
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > David Riccitelli
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > > InsideOut10 s.r.l.
>>> > > > P.IVA: IT-11381771002
>>> > > > Fax: +39 0110708239
>>> > > > ---
>>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > > > Twitter: ziodave
>>> > > > ---
>>> > > > Layar Partner Network<
>>> > > >
>>> > >
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> ********************************************************************************
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > David Riccitelli
>>> >
>>> >
>>> >
>>> ********************************************************************************
>>> > InsideOut10 s.r.l.
>>> > P.IVA: IT-11381771002
>>> > Fax: +39 0110708239
>>> > ---
>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > Twitter: ziodave
>>> > ---
>>> > Layar Partner Network<
>>> >
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> > >
>>> >
>>> >
>>> ********************************************************************************
>>> >
>>>
>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
I tried with another log file and that does not happen, so I suppose there's
some 'corrupted' line in the one I was testing.
On Fri, Aug 19, 2011 at 4:56 PM, David Riccitelli <da...@insideout.io>wrote:
> There's something strage in the results however:
> (00,129,30096)
> (01,91,16487)
> (02,57,11686)
> (03,41,6041)
> (04,30,4882)
> (05,33,4154)
> (06,65,8031)
> (07,66,12260)
> (08,95,17924)
> (09,131,21187)
> (10,162,26607)
> (11,155,28503)
> (12,146,27863)
> (13,152,29130)
> (14,159,32784)
> (15,150,28898)
> (16,143,28973)
> (17,169,29024)
> (18,199,26585)
> (19,182,28803)
> (20,224,32511)
> (21,232,38584)
> (22,225,39924)
> (23,191,33606)
> (,0,0)
>
>
> What is the last line:
> (,0,0)
> the count is zero, it shouldn't really be there, correct?
>
> (Using pig 0.9.0)
>
>
> Thanks,
> David
>
> On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> Right, that should read "by_hour_client.num_reqs".
>>
>> Don't trust relative measurements you get for small data on a single
>> computer in local mode. Things change when you start running on hundreds
>> of
>> gigs with real skew on a cluster.
>>
>> D
>>
>> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
>> >wrote:
>>
>> > Thanks Dmitriy,
>> >
>> > The second method took less than 26 secs. on my computer (~550.000
>> lines).
>> > The first method is giving me the following error:
>> >
>> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
>> > <line 34, column 7> Invalid field projection. Projected field [num_reqs]
>> > does not exist in schema:
>> >
>> >
>> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>> >
>> > when I try to set the by_hour (after having set the by_hour_client):
>> >
>> > grunt> by_hour_client =
>> > >> foreach
>> > >> (group logs by (hour, client))
>> > >> generate
>> > >> flatten(group) as (hour, client),
>> > >> COUNT(logs) as num_reqs;
>> > grunt> by_hour =
>> > >> foreach
>> > >> (group by_hour_client by hour)
>> > >> generate
>> > >> group as hour,
>> > >> COUNT(by_hour_client) as num_dist_clients,
>> > >> SUM(num_reqs) as total_requests;
>> >
>> > If I understood correctly that's because the num_reqs is in the bag, as
>> a
>> > result of the
>> > * (group by_hour_client by hour)*
>> > correct? So I changed the last line to
>> > *SUM(by_hour_client.num_reqs) as total_requests;*
>> > and it worked (it took a little more than 29 seconds).
>> >
>> > Thanks for your help,
>> > David
>> >
>> >
>> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> >
>> > > by_hour_client =
>> > > foreach
>> > > (group logs by (hour, client) parallel $p)
>> > > generate
>> > > flatten(group) as (hour, client),
>> > > COUNT(logs) as num_reqs;
>> > >
>> > > by_hour =
>> > > foreach
>> > > (group by_hour_client by hour parallel $p2)
>> > > generate
>> > > group as hour,
>> > > COUNT(by_hour_client) as num_dist_clients,
>> > > SUM(num_reqs) as total_requests;
>> > >
>> > > You can also do this using a nested distinct, but depending on what
>> your
>> > > data looks like, it might be a bad idea, as it can put a lot of
>> pressure
>> > on
>> > > individual reducers that have to do the inner distinct in memory
>> > (although
>> > > they do push part of this up to the mappers):
>> > >
>> > > by_hour =
>> > > foreach (group logs by hour) {
>> > > dist_clients = distinct logs.client;
>> > > generate
>> > > group as hour,
>> > > COUNT(dist_clients) as num_dist_clients,
>> > > COUNT(logs) as total_requests;
>> > > }
>> > >
>> > > D
>> > >
>> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
>> > > >wrote:
>> > >
>> > > > I'm analyzing a daily apache log file. I'd like to get the number of
>> > > > requests and of visits by hour.
>> > > >
>> > > > I managed to get the requests, but how do I get the visits?
>> > > >
>> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
>> > > (line:chararray);
>> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
>> > > > FLATTEN(
>> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
>> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
>> > (\\+\\d{4})\\]
>> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
>> > > > ) AS (
>> > > > client: chararray,
>> > > > username: chararray,
>> > > > date: chararray,
>> > > > hour: chararray,
>> > > > minute: chararray,
>> > > > second: chararray,
>> > > > timeZone: chararray,
>> > > > request: chararray,
>> > > > statusCode: int,
>> > > > bytesSent: chararray,
>> > > > referer: chararray,
>> > > > userAgent: chararray,
>> > > > remoteUser: chararray,
>> > > > timeTaken: chararray
>> > > > );
>> > > > grunt> A = GROUP LOGS_BASE BY hour;
>> > > > DESCRIBE A;
>> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
>> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
>> > > > chararray,timeZone: chararray,request: chararray,statusCode:
>> > > int,bytesSent:
>> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
>> > > > chararray,timeTaken: chararray)}}
>> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
>> > > > grunt> C = ORDER B BY hour; -- requests by hour
>> > > >
>> > > > How can I now get the distinct count of clients per hour?
>> > > >
>> > > > Thanks for your help!
>> > > >
>> > > > --
>> > > > David Riccitelli
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > > InsideOut10 s.r.l.
>> > > > P.IVA: IT-11381771002
>> > > > Fax: +39 0110708239
>> > > > ---
>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > > > Twitter: ziodave
>> > > > ---
>> > > > Layar Partner Network<
>> > > >
>> > >
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > David Riccitelli
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > > InsideOut10 s.r.l.
>> > > > P.IVA: IT-11381771002
>> > > > Fax: +39 0110708239
>> > > > ---
>> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > > > Twitter: ziodave
>> > > > ---
>> > > > Layar Partner Network<
>> > > >
>> > >
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> ********************************************************************************
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > David Riccitelli
>> >
>> >
>> >
>> ********************************************************************************
>> > InsideOut10 s.r.l.
>> > P.IVA: IT-11381771002
>> > Fax: +39 0110708239
>> > ---
>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > Twitter: ziodave
>> > ---
>> > Layar Partner Network<
>> >
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> > >
>> >
>> >
>> ********************************************************************************
>> >
>>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
There's something strage in the results however:
(00,129,30096)
(01,91,16487)
(02,57,11686)
(03,41,6041)
(04,30,4882)
(05,33,4154)
(06,65,8031)
(07,66,12260)
(08,95,17924)
(09,131,21187)
(10,162,26607)
(11,155,28503)
(12,146,27863)
(13,152,29130)
(14,159,32784)
(15,150,28898)
(16,143,28973)
(17,169,29024)
(18,199,26585)
(19,182,28803)
(20,224,32511)
(21,232,38584)
(22,225,39924)
(23,191,33606)
(,0,0)
What is the last line:
(,0,0)
the count is zero, it shouldn't really be there, correct?
(Using pig 0.9.0)
Thanks,
David
On Fri, Aug 19, 2011 at 3:58 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> Right, that should read "by_hour_client.num_reqs".
>
> Don't trust relative measurements you get for small data on a single
> computer in local mode. Things change when you start running on hundreds of
> gigs with real skew on a cluster.
>
> D
>
> On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <david@insideout.io
> >wrote:
>
> > Thanks Dmitriy,
> >
> > The second method took less than 26 secs. on my computer (~550.000
> lines).
> > The first method is giving me the following error:
> >
> > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
> > <line 34, column 7> Invalid field projection. Projected field [num_reqs]
> > does not exist in schema:
> >
> >
> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
> >
> > when I try to set the by_hour (after having set the by_hour_client):
> >
> > grunt> by_hour_client =
> > >> foreach
> > >> (group logs by (hour, client))
> > >> generate
> > >> flatten(group) as (hour, client),
> > >> COUNT(logs) as num_reqs;
> > grunt> by_hour =
> > >> foreach
> > >> (group by_hour_client by hour)
> > >> generate
> > >> group as hour,
> > >> COUNT(by_hour_client) as num_dist_clients,
> > >> SUM(num_reqs) as total_requests;
> >
> > If I understood correctly that's because the num_reqs is in the bag, as a
> > result of the
> > * (group by_hour_client by hour)*
> > correct? So I changed the last line to
> > *SUM(by_hour_client.num_reqs) as total_requests;*
> > and it worked (it took a little more than 29 seconds).
> >
> > Thanks for your help,
> > David
> >
> >
> > On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > by_hour_client =
> > > foreach
> > > (group logs by (hour, client) parallel $p)
> > > generate
> > > flatten(group) as (hour, client),
> > > COUNT(logs) as num_reqs;
> > >
> > > by_hour =
> > > foreach
> > > (group by_hour_client by hour parallel $p2)
> > > generate
> > > group as hour,
> > > COUNT(by_hour_client) as num_dist_clients,
> > > SUM(num_reqs) as total_requests;
> > >
> > > You can also do this using a nested distinct, but depending on what
> your
> > > data looks like, it might be a bad idea, as it can put a lot of
> pressure
> > on
> > > individual reducers that have to do the inner distinct in memory
> > (although
> > > they do push part of this up to the mappers):
> > >
> > > by_hour =
> > > foreach (group logs by hour) {
> > > dist_clients = distinct logs.client;
> > > generate
> > > group as hour,
> > > COUNT(dist_clients) as num_dist_clients,
> > > COUNT(logs) as total_requests;
> > > }
> > >
> > > D
> > >
> > > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
> > > >wrote:
> > >
> > > > I'm analyzing a daily apache log file. I'd like to get the number of
> > > > requests and of visits by hour.
> > > >
> > > > I managed to get the requests, but how do I get the visits?
> > > >
> > > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> > > (line:chararray);
> > > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > > > FLATTEN(
> > > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
> > (\\+\\d{4})\\]
> > > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > > > ) AS (
> > > > client: chararray,
> > > > username: chararray,
> > > > date: chararray,
> > > > hour: chararray,
> > > > minute: chararray,
> > > > second: chararray,
> > > > timeZone: chararray,
> > > > request: chararray,
> > > > statusCode: int,
> > > > bytesSent: chararray,
> > > > referer: chararray,
> > > > userAgent: chararray,
> > > > remoteUser: chararray,
> > > > timeTaken: chararray
> > > > );
> > > > grunt> A = GROUP LOGS_BASE BY hour;
> > > > DESCRIBE A;
> > > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > > > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > > > chararray,timeZone: chararray,request: chararray,statusCode:
> > > int,bytesSent:
> > > > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > > > chararray,timeTaken: chararray)}}
> > > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > > > grunt> C = ORDER B BY hour; -- requests by hour
> > > >
> > > > How can I now get the distinct count of clients per hour?
> > > >
> > > > Thanks for your help!
> > > >
> > > > --
> > > > David Riccitelli
> > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > > InsideOut10 s.r.l.
> > > > P.IVA: IT-11381771002
> > > > Fax: +39 0110708239
> > > > ---
> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > > Twitter: ziodave
> > > > ---
> > > > Layar Partner Network<
> > > >
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > David Riccitelli
> > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > > InsideOut10 s.r.l.
> > > > P.IVA: IT-11381771002
> > > > Fax: +39 0110708239
> > > > ---
> > > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > > Twitter: ziodave
> > > > ---
> > > > Layar Partner Network<
> > > >
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > > >
> > > >
> > > >
> > >
> >
> ********************************************************************************
> > > >
> > >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Right, that should read "by_hour_client.num_reqs".
Don't trust relative measurements you get for small data on a single
computer in local mode. Things change when you start running on hundreds of
gigs with real skew on a cluster.
D
On Fri, Aug 19, 2011 at 5:48 AM, David Riccitelli <da...@insideout.io>wrote:
> Thanks Dmitriy,
>
> The second method took less than 26 secs. on my computer (~550.000 lines).
> The first method is giving me the following error:
>
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
> <line 34, column 7> Invalid field projection. Projected field [num_reqs]
> does not exist in schema:
>
> group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
>
> when I try to set the by_hour (after having set the by_hour_client):
>
> grunt> by_hour_client =
> >> foreach
> >> (group logs by (hour, client))
> >> generate
> >> flatten(group) as (hour, client),
> >> COUNT(logs) as num_reqs;
> grunt> by_hour =
> >> foreach
> >> (group by_hour_client by hour)
> >> generate
> >> group as hour,
> >> COUNT(by_hour_client) as num_dist_clients,
> >> SUM(num_reqs) as total_requests;
>
> If I understood correctly that's because the num_reqs is in the bag, as a
> result of the
> * (group by_hour_client by hour)*
> correct? So I changed the last line to
> *SUM(by_hour_client.num_reqs) as total_requests;*
> and it worked (it took a little more than 29 seconds).
>
> Thanks for your help,
> David
>
>
> On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > by_hour_client =
> > foreach
> > (group logs by (hour, client) parallel $p)
> > generate
> > flatten(group) as (hour, client),
> > COUNT(logs) as num_reqs;
> >
> > by_hour =
> > foreach
> > (group by_hour_client by hour parallel $p2)
> > generate
> > group as hour,
> > COUNT(by_hour_client) as num_dist_clients,
> > SUM(num_reqs) as total_requests;
> >
> > You can also do this using a nested distinct, but depending on what your
> > data looks like, it might be a bad idea, as it can put a lot of pressure
> on
> > individual reducers that have to do the inner distinct in memory
> (although
> > they do push part of this up to the mappers):
> >
> > by_hour =
> > foreach (group logs by hour) {
> > dist_clients = distinct logs.client;
> > generate
> > group as hour,
> > COUNT(dist_clients) as num_dist_clients,
> > COUNT(logs) as total_requests;
> > }
> >
> > D
> >
> > On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
> > >wrote:
> >
> > > I'm analyzing a daily apache log file. I'd like to get the number of
> > > requests and of visits by hour.
> > >
> > > I managed to get the requests, but how do I get the visits?
> > >
> > > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> > (line:chararray);
> > > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > > FLATTEN(
> > > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2})
> (\\+\\d{4})\\]
> > > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > > ) AS (
> > > client: chararray,
> > > username: chararray,
> > > date: chararray,
> > > hour: chararray,
> > > minute: chararray,
> > > second: chararray,
> > > timeZone: chararray,
> > > request: chararray,
> > > statusCode: int,
> > > bytesSent: chararray,
> > > referer: chararray,
> > > userAgent: chararray,
> > > remoteUser: chararray,
> > > timeTaken: chararray
> > > );
> > > grunt> A = GROUP LOGS_BASE BY hour;
> > > DESCRIBE A;
> > > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > > chararray,timeZone: chararray,request: chararray,statusCode:
> > int,bytesSent:
> > > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > > chararray,timeTaken: chararray)}}
> > > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > > grunt> C = ORDER B BY hour; -- requests by hour
> > >
> > > How can I now get the distinct count of clients per hour?
> > >
> > > Thanks for your help!
> > >
> > > --
> > > David Riccitelli
> > >
> > >
> > >
> >
> ********************************************************************************
> > > InsideOut10 s.r.l.
> > > P.IVA: IT-11381771002
> > > Fax: +39 0110708239
> > > ---
> > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > Twitter: ziodave
> > > ---
> > > Layar Partner Network<
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > >
> > >
> > >
> >
> ********************************************************************************
> > >
> > >
> > >
> > >
> > > --
> > > David Riccitelli
> > >
> > >
> > >
> >
> ********************************************************************************
> > > InsideOut10 s.r.l.
> > > P.IVA: IT-11381771002
> > > Fax: +39 0110708239
> > > ---
> > > LinkedIn: http://it.linkedin.com/in/riccitelli
> > > Twitter: ziodave
> > > ---
> > > Layar Partner Network<
> > >
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > > >
> > >
> > >
> >
> ********************************************************************************
> > >
> >
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>
Re: calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
Thanks Dmitriy,
The second method took less than 26 secs. on my computer (~550.000 lines).
The first method is giving me the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 34, column 7> Invalid field projection. Projected field [num_reqs]
does not exist in schema:
group:chararray,by_hour_client:bag{:tuple(hour:chararray,client:chararray,num_reqs:long)}.
when I try to set the by_hour (after having set the by_hour_client):
grunt> by_hour_client =
>> foreach
>> (group logs by (hour, client))
>> generate
>> flatten(group) as (hour, client),
>> COUNT(logs) as num_reqs;
grunt> by_hour =
>> foreach
>> (group by_hour_client by hour)
>> generate
>> group as hour,
>> COUNT(by_hour_client) as num_dist_clients,
>> SUM(num_reqs) as total_requests;
If I understood correctly that's because the num_reqs is in the bag, as a
result of the
* (group by_hour_client by hour)*
correct? So I changed the last line to
*SUM(by_hour_client.num_reqs) as total_requests;*
and it worked (it took a little more than 29 seconds).
Thanks for your help,
David
On Fri, Aug 19, 2011 at 2:51 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> by_hour_client =
> foreach
> (group logs by (hour, client) parallel $p)
> generate
> flatten(group) as (hour, client),
> COUNT(logs) as num_reqs;
>
> by_hour =
> foreach
> (group by_hour_client by hour parallel $p2)
> generate
> group as hour,
> COUNT(by_hour_client) as num_dist_clients,
> SUM(num_reqs) as total_requests;
>
> You can also do this using a nested distinct, but depending on what your
> data looks like, it might be a bad idea, as it can put a lot of pressure on
> individual reducers that have to do the inner distinct in memory (although
> they do push part of this up to the mappers):
>
> by_hour =
> foreach (group logs by hour) {
> dist_clients = distinct logs.client;
> generate
> group as hour,
> COUNT(dist_clients) as num_dist_clients,
> COUNT(logs) as total_requests;
> }
>
> D
>
> On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <david@insideout.io
> >wrote:
>
> > I'm analyzing a daily apache log file. I'd like to get the number of
> > requests and of visits by hour.
> >
> > I managed to get the requests, but how do I get the visits?
> >
> > grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS
> (line:chararray);
> > grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> > FLATTEN(
> > REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> > \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> > "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> > ) AS (
> > client: chararray,
> > username: chararray,
> > date: chararray,
> > hour: chararray,
> > minute: chararray,
> > second: chararray,
> > timeZone: chararray,
> > request: chararray,
> > statusCode: int,
> > bytesSent: chararray,
> > referer: chararray,
> > userAgent: chararray,
> > remoteUser: chararray,
> > timeTaken: chararray
> > );
> > grunt> A = GROUP LOGS_BASE BY hour;
> > DESCRIBE A;
> > A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> > chararray,date: chararray,hour: chararray,minute: chararray,second:
> > chararray,timeZone: chararray,request: chararray,statusCode:
> int,bytesSent:
> > chararray,referer: chararray,userAgent: chararray,remoteUser:
> > chararray,timeTaken: chararray)}}
> > grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> > grunt> C = ORDER B BY hour; -- requests by hour
> >
> > How can I now get the distinct count of clients per hour?
> >
> > Thanks for your help!
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
> >
> >
> >
> > --
> > David Riccitelli
> >
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> >
> >
> ********************************************************************************
> >
>
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
Re: calculate requests and visits (nested groups?)
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
by_hour_client =
foreach
(group logs by (hour, client) parallel $p)
generate
flatten(group) as (hour, client),
COUNT(logs) as num_reqs;
by_hour =
foreach
(group by_hour_client by hour parallel $p2)
generate
group as hour,
COUNT(by_hour_client) as num_dist_clients,
SUM(num_reqs) as total_requests;
You can also do this using a nested distinct, but depending on what your
data looks like, it might be a bad idea, as it can put a lot of pressure on
individual reducers that have to do the inner distinct in memory (although
they do push part of this up to the mappers):
by_hour =
foreach (group logs by hour) {
dist_clients = distinct logs.client;
generate
group as hour,
COUNT(dist_clients) as num_dist_clients,
COUNT(logs) as total_requests;
}
D
On Fri, Aug 19, 2011 at 3:09 AM, David Riccitelli <da...@insideout.io>wrote:
> I'm analyzing a daily apache log file. I'd like to get the number of
> requests and of visits by hour.
>
> I managed to get the requests, but how do I get the visits?
>
> grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
> grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
> FLATTEN(
> REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
> \\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
> "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
> ) AS (
> client: chararray,
> username: chararray,
> date: chararray,
> hour: chararray,
> minute: chararray,
> second: chararray,
> timeZone: chararray,
> request: chararray,
> statusCode: int,
> bytesSent: chararray,
> referer: chararray,
> userAgent: chararray,
> remoteUser: chararray,
> timeTaken: chararray
> );
> grunt> A = GROUP LOGS_BASE BY hour;
> DESCRIBE A;
> A: {group: chararray,LOGS_BASE: {(client: chararray,username:
> chararray,date: chararray,hour: chararray,minute: chararray,second:
> chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
> chararray,referer: chararray,userAgent: chararray,remoteUser:
> chararray,timeTaken: chararray)}}
> grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
> grunt> C = ORDER B BY hour; -- requests by hour
>
> How can I now get the distinct count of clients per hour?
>
> Thanks for your help!
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>
>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>
calculate requests and visits (nested groups?)
Posted by David Riccitelli <da...@insideout.io>.
I'm analyzing a daily apache log file. I'd like to get the number of
requests and of visits by hour.
I managed to get the requests, but how do I get the visits?
grunt> RAW_LOGS = LOAD '<log-file>' USING TextLoader() AS (line:chararray);
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(\\S+) (\\S+)
\\[(\\d{2}/\\w{3}/\\d{4})\\:(\\d{2})\\:(\\d{2})\\:(\\d{2}) (\\+\\d{4})\\]
"(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\S+) (\\S+)')
) AS (
client: chararray,
username: chararray,
date: chararray,
hour: chararray,
minute: chararray,
second: chararray,
timeZone: chararray,
request: chararray,
statusCode: int,
bytesSent: chararray,
referer: chararray,
userAgent: chararray,
remoteUser: chararray,
timeTaken: chararray
);
grunt> A = GROUP LOGS_BASE BY hour;
DESCRIBE A;
A: {group: chararray,LOGS_BASE: {(client: chararray,username:
chararray,date: chararray,hour: chararray,minute: chararray,second:
chararray,timeZone: chararray,request: chararray,statusCode: int,bytesSent:
chararray,referer: chararray,userAgent: chararray,remoteUser:
chararray,timeTaken: chararray)}}
grunt> B = FOREACH A GENERATE group AS hour, COUNT( $1 );
grunt> C = ORDER B BY hour; -- requests by hour
How can I now get the distinct count of clients per hour?
Thanks for your help!
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************
--
David Riccitelli
********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************