You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohit Laddha (JIRA)" <ji...@apache.org> on 2014/02/28 08:44:20 UTC

[jira] [Commented] (PIG-3119) Aggregation not working in conjunction with REGEX_EXTRACT_ALL

    [ https://issues.apache.org/jira/browse/PIG-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915532#comment-13915532 ] 

Rohit Laddha commented on PIG-3119:
-----------------------------------

I think problem is not aggregation with REGEX_EXTRACT_ALL. Problem is B is not having the expected output. It has empty tuples. So problem lies in REGEX_EXTREACT_ALL. It is not giving expected output.

> Aggregation not working in conjunction with REGEX_EXTRACT_ALL
> -------------------------------------------------------------
>
>                 Key: PIG-3119
>                 URL: https://issues.apache.org/jira/browse/PIG-3119
>             Project: Pig
>          Issue Type: Bug
>          Components: build, grunt
>    Affects Versions: 0.9.1
>         Environment: OS -version
> ================================
> Linux version 2.6.18-194.3.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48))
> software installed
> =======================
> hadoop-1.0.4
> pig-0.9.1
> Hardware details
> ====================================
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 26
> model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
> stepping        : 4
> cpu MHz         : 2800.098
> cache size      : 8192 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
>            Reporter: siddhartha Pattanaik
>            Priority: Critical
>              Labels: newbie
>             Fix For: 0.9.1
>
>         Attachments: starwar_log1.txt
>
>   Original Estimate: 276h
>  Remaining Estimate: 276h
>
> Hi ,
> I have a use case in my project requirement,
> The i/p file consist of the following pattern:-
> 192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET /img/explore/encyclopedia/characters/yoda_card.jpg HTTP/1.1" 200 22620 "http://www.starwars.com/explore/encyclopedia/characters/2/featured/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)" "Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3; __utma=181359608.119611689.1337206567.1337206567.1337206567.1; __utmb=181359608.79.9.1337209104786; __utmc=181359608; __utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=aHX_NQheRq08" "-" 0
> I want to run a aggregate function along with regex_extract_all to extract the desired data.
> Even though the i/p file is parsing.I have issue with aggregate function working on it.
> Please find the below pig script:-
> ***************Ip_adress-count************************
> Ip_adress_count.pig
>  
> A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray);
> B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "([^"]*)" (\\S+) ') ) AS 
> (
> remoteAddr: chararray, 
> remoteLogname: chararray, 
> user: chararray,  
> time: chararray, 
> request: chararray, 
> status: int, 
> bytes_string: chararray, 
> referrer: chararray, 
> Mozilla: chararray,
> wookie_cookie: chararray,
> browser3: chararray,
> acess_status:int
> );
> C = group B by remoteAddr;
> D = foreach C generate COUNT(B) as ip_adress_count;
> E = order D by ip_adress_count;
> F = STORE E INTO ‘ip_adress_count/' using PigStorage(',');
> Expected O/p
> ===========================
> ip_adress_count
> remoteAddr,ip_adress_count
> 192.168.90.36,19
> 192.168.90.37,1
> There is no parsing issue but the aggregate function count() is not working over the regex_extract_all function for regular expression.
> Please do the need.The requirement is I need the count of the ip adresses from the ip data.
> thanks,
> siddharth
> contact -8763666372



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)