You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "siddhartha Pattanaik (JIRA)" <ji...@apache.org> on 2013/01/10 11:14:12 UTC

[jira] [Created] (PIG-3119) REGEX_EXTRACT_ALL custom with aggregation function

siddhartha Pattanaik created PIG-3119:
-----------------------------------------

             Summary: REGEX_EXTRACT_ALL custom with aggregation function
                 Key: PIG-3119
                 URL: https://issues.apache.org/jira/browse/PIG-3119
             Project: Pig
          Issue Type: Bug
          Components: build, grunt
    Affects Versions: 0.9.1
         Environment: OS -version
================================
Linux version 2.6.18-194.3.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48))

software installed
=======================
hadoop-1.0.4
pig-0.9.1

Hardware details
====================================
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.098
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11


            Reporter: siddhartha Pattanaik
            Priority: Critical
             Fix For: 0.9.1


Hi ,

I have a use case in my project requirement,

The i/p file consist of the following pattern:-

192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET /img/explore/encyclopedia/characters/yoda_card.jpg HTTP/1.1" 200 22620 "http://www.starwars.com/explore/encyclopedia/characters/2/featured/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)" "Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3; __utma=181359608.119611689.1337206567.1337206567.1337206567.1; __utmb=181359608.79.9.1337209104786; __utmc=181359608; __utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=aHX_NQheRq08" "-" 0

I want to run a aggregate function along with regex_extract_all to extract the desired data.
Even though the i/p file is parsing.I have issue with aggregate function working on it.

Please find the below pig script:-

***************Ip_adress-count************************
Ip_adress_count.pig
 
A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "([^"]*)" (\\S+) ') ) AS 
(
remoteAddr: chararray, 
remoteLogname: chararray, 
user: chararray,  
time: chararray, 
request: chararray, 
status: int, 
bytes_string: chararray, 
referrer: chararray, 
Mozilla: chararray,
wookie_cookie: chararray,
browser3: chararray,
acess_status:int
);
C = group B by remoteAddr;
D = foreach C generate COUNT(B) as ip_adress_count;
E = order D by ip_adress_count;
F = STORE E INTO ‘ip_adress_count/' using PigStorage(',');

Expected O/p
===========================

ip_adress_count
remoteAddr,ip_adress_count

192.168.90.36,19
192.168.90.37,1

There is no parsing issue but the aggregate function count() is not working over the regex_extract_all function for regular expression.

Please do the need.The requirement is I need the count of the ip adresses from the ip data.

thanks,
siddharth
contact -8763666372



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira