You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "siddhartha Pattanaik (JIRA)" <ji...@apache.org> on 2013/01/10 11:14:12 UTC
[jira] [Created] (PIG-3119) REGEX_EXTRACT_ALL custom with
aggregation function
siddhartha Pattanaik created PIG-3119:
-----------------------------------------
Summary: REGEX_EXTRACT_ALL custom with aggregation function
Key: PIG-3119
URL: https://issues.apache.org/jira/browse/PIG-3119
Project: Pig
Issue Type: Bug
Components: build, grunt
Affects Versions: 0.9.1
Environment: OS -version
================================
Linux version 2.6.18-194.3.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48))
software installed
=======================
hadoop-1.0.4
pig-0.9.1
Hardware details
====================================
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 4
cpu MHz : 2800.098
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
Reporter: siddhartha Pattanaik
Priority: Critical
Fix For: 0.9.1
Hi ,
I have a use case in my project requirement,
The i/p file consist of the following pattern:-
192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET /img/explore/encyclopedia/characters/yoda_card.jpg HTTP/1.1" 200 22620 "http://www.starwars.com/explore/encyclopedia/characters/2/featured/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)" "Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3; __utma=181359608.119611689.1337206567.1337206567.1337206567.1; __utmb=181359608.79.9.1337209104786; __utmc=181359608; __utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=aHX_NQheRq08" "-" 0
I want to run a aggregate function along with regex_extract_all to extract the desired data.
Even though the i/p file is parsing.I have issue with aggregate function working on it.
Please find the below pig script:-
***************Ip_adress-count************************
Ip_adress_count.pig
A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "([^"]*)" (\\S+) ') ) AS
(
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
request: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
Mozilla: chararray,
wookie_cookie: chararray,
browser3: chararray,
acess_status:int
);
C = group B by remoteAddr;
D = foreach C generate COUNT(B) as ip_adress_count;
E = order D by ip_adress_count;
F = STORE E INTO ‘ip_adress_count/' using PigStorage(',');
Expected O/p
===========================
ip_adress_count
remoteAddr,ip_adress_count
192.168.90.36,19
192.168.90.37,1
There is no parsing issue but the aggregate function count() is not working over the regex_extract_all function for regular expression.
Please do the need.The requirement is I need the count of the ip adresses from the ip data.
thanks,
siddharth
contact -8763666372
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira