You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by hiba houimli <ho...@hotmail.fr> on 2011/03/27 13:48:59 UTC

transform sql to pig

Hi,
I like to know how I can transform a query sql  to a script pig especially when I have a query like that :
selectp_brand,p_type,p_size,count(distinct ps_suppkey) as supplier_cntfrompartsupp,partwherep_partkey = ps_partkeyand p_brand <> '[BRAND]'and p_type not like '[TYPE]%'and p_size in ([SIZE1], [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])and ps_suppkey not in (selects_suppkeyfromsupplierwheres_comment like '%Customer%Complaints%')group byp_brand,p_type,p_sizeorder bysupplier_cnt desc,p_brand,p_type,p_size;
or like that:
selectc_count, count(*) as custdistfrom (selectc_custkey,count(o_orderkey)fromcustomer left outer join orders onc_custkey = o_custkeyand o_comment not like ‘%[WORD1]%[WORD2]%’group byc_custkey)as c_orders (c_custkey, c_count)group byc_countorder bycustdist desc,c_count desc;

and I like to know how I transform the clause exists, in or not in to a pig operator and thank you ? 

 		 	   		  

Re: transform sql to pig

Posted by Alan Gates <ga...@yahoo-inc.com>.
In Pig Latin data flows are more linear than SQL, so SQL subqueries  
tend to come first in Pig Latin scripts.  Given that, your first SQL  
query would look roughly like:

-- Find the records from s that will form the 'in' clause
A = load 'supplier' as (<whatever it's schema is>); --
B = filter A by s_comment matches '.*Customer.*Complaints.*';
C = foreach B generate s_suppkey; -- project out everything but the  
field you're interested in

D = load 'partsupp' as (<schema>);
E = load 'part' as (<schema>);
F = join D by ps_partkey, E by p_partkey; -- do the join
G = filter F by p_brand != '[BRAND]'and p_type not matches  
'[TYPE].*'and p_size == '[SIZE1]' or p_size == '[SIZE2]' ... -- do the  
big harry filter, pretty much the same except Pig doesn't support the  
IN syntax

-- Rewrite the subquery as an anti-join
H = cogroup G by ps_suppkey, C by s_suppkey);
I = filter H by COUNT(C) == 0;
J = foreach I generate flatten(G);

-- Now do the group by
K = group J by _brand,p_type,p_sizeorder bysupplier_cnt  
desc,p_brand,p_type,p_size;
L = foreach K {
	L1 = J.ps_suppkey;
	L2 = distinct L1;
	generate group, COUNT(L2);
}

I think that will more or less do it.

Alan.

On Mar 27, 2011, at 4:48 AM, hiba houimli wrote:

>
> Hi,
> I like to know how I can transform a query sql  to a script pig  
> especially when I have a query like that :
> selectp_brand,p_type,p_size,count(distinct ps_suppkey) as  
> supplier_cntfrompartsupp,partwherep_partkey = ps_partkeyand p_brand  
> <> '[BRAND]'and p_type not like '[TYPE]%'and p_size in ([SIZE1],  
> [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])and  
> ps_suppkey not in (selects_suppkeyfromsupplierwheres_comment like  
> '%Customer%Complaints%')group byp_brand,p_type,p_sizeorder  
> bysupplier_cnt desc,p_brand,p_type,p_size;
> or like that:
> selectc_count, count(*) as custdistfrom  
> (selectc_custkey,count(o_orderkey)fromcustomer left outer join  
> orders onc_custkey = o_custkeyand o_comment not like ‘%[WORD1]% 
> [WORD2]%’group byc_custkey)as c_orders (c_custkey, c_count)group  
> byc_countorder bycustdist desc,c_count desc;
>
> and I like to know how I transform the clause exists, in or not in  
> to a pig operator and thank you ?
>
>