You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2011/04/01 01:45:33 UTC

Re: transform sql to pig

In Pig Latin data flows are more linear than SQL, so SQL subqueries  
tend to come first in Pig Latin scripts.  Given that, your first SQL  
query would look roughly like:

-- Find the records from s that will form the 'in' clause
A = load 'supplier' as (<whatever it's schema is>); --
B = filter A by s_comment matches '.*Customer.*Complaints.*';
C = foreach B generate s_suppkey; -- project out everything but the  
field you're interested in

D = load 'partsupp' as (<schema>);
E = load 'part' as (<schema>);
F = join D by ps_partkey, E by p_partkey; -- do the join
G = filter F by p_brand != '[BRAND]'and p_type not matches  
'[TYPE].*'and p_size == '[SIZE1]' or p_size == '[SIZE2]' ... -- do the  
big harry filter, pretty much the same except Pig doesn't support the  
IN syntax

-- Rewrite the subquery as an anti-join
H = cogroup G by ps_suppkey, C by s_suppkey);
I = filter H by COUNT(C) == 0;
J = foreach I generate flatten(G);

-- Now do the group by
K = group J by _brand,p_type,p_sizeorder bysupplier_cnt  
desc,p_brand,p_type,p_size;
L = foreach K {
	L1 = J.ps_suppkey;
	L2 = distinct L1;
	generate group, COUNT(L2);
}

I think that will more or less do it.

Alan.

On Mar 27, 2011, at 4:48 AM, hiba houimli wrote:

>
> Hi,
> I like to know how I can transform a query sql  to a script pig  
> especially when I have a query like that :
> selectp_brand,p_type,p_size,count(distinct ps_suppkey) as  
> supplier_cntfrompartsupp,partwherep_partkey = ps_partkeyand p_brand  
> <> '[BRAND]'and p_type not like '[TYPE]%'and p_size in ([SIZE1],  
> [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])and  
> ps_suppkey not in (selects_suppkeyfromsupplierwheres_comment like  
> '%Customer%Complaints%')group byp_brand,p_type,p_sizeorder  
> bysupplier_cnt desc,p_brand,p_type,p_size;
> or like that:
> selectc_count, count(*) as custdistfrom  
> (selectc_custkey,count(o_orderkey)fromcustomer left outer join  
> orders onc_custkey = o_custkeyand o_comment not like ‘%[WORD1]% 
> [WORD2]%’group byc_custkey)as c_orders (c_custkey, c_count)group  
> byc_countorder bycustdist desc,c_count desc;
>
> and I like to know how I transform the clause exists, in or not in  
> to a pig operator and thank you ?
>
>