You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by hiba houimli <ho...@hotmail.fr> on 2011/03/27 13:48:59 UTC
transform sql to pig
Hi,
I like to know how I can transform a query sql to a script pig especially when I have a query like that :
selectp_brand,p_type,p_size,count(distinct ps_suppkey) as supplier_cntfrompartsupp,partwherep_partkey = ps_partkeyand p_brand <> '[BRAND]'and p_type not like '[TYPE]%'and p_size in ([SIZE1], [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])and ps_suppkey not in (selects_suppkeyfromsupplierwheres_comment like '%Customer%Complaints%')group byp_brand,p_type,p_sizeorder bysupplier_cnt desc,p_brand,p_type,p_size;
or like that:
selectc_count, count(*) as custdistfrom (selectc_custkey,count(o_orderkey)fromcustomer left outer join orders onc_custkey = o_custkeyand o_comment not like ‘%[WORD1]%[WORD2]%’group byc_custkey)as c_orders (c_custkey, c_count)group byc_countorder bycustdist desc,c_count desc;
and I like to know how I transform the clause exists, in or not in to a pig operator and thank you ?
Re: transform sql to pig
Posted by Alan Gates <ga...@yahoo-inc.com>.
In Pig Latin data flows are more linear than SQL, so SQL subqueries
tend to come first in Pig Latin scripts. Given that, your first SQL
query would look roughly like:
-- Find the records from s that will form the 'in' clause
A = load 'supplier' as (<whatever it's schema is>); --
B = filter A by s_comment matches '.*Customer.*Complaints.*';
C = foreach B generate s_suppkey; -- project out everything but the
field you're interested in
D = load 'partsupp' as (<schema>);
E = load 'part' as (<schema>);
F = join D by ps_partkey, E by p_partkey; -- do the join
G = filter F by p_brand != '[BRAND]'and p_type not matches
'[TYPE].*'and p_size == '[SIZE1]' or p_size == '[SIZE2]' ... -- do the
big harry filter, pretty much the same except Pig doesn't support the
IN syntax
-- Rewrite the subquery as an anti-join
H = cogroup G by ps_suppkey, C by s_suppkey);
I = filter H by COUNT(C) == 0;
J = foreach I generate flatten(G);
-- Now do the group by
K = group J by _brand,p_type,p_sizeorder bysupplier_cnt
desc,p_brand,p_type,p_size;
L = foreach K {
L1 = J.ps_suppkey;
L2 = distinct L1;
generate group, COUNT(L2);
}
I think that will more or less do it.
Alan.
On Mar 27, 2011, at 4:48 AM, hiba houimli wrote:
>
> Hi,
> I like to know how I can transform a query sql to a script pig
> especially when I have a query like that :
> selectp_brand,p_type,p_size,count(distinct ps_suppkey) as
> supplier_cntfrompartsupp,partwherep_partkey = ps_partkeyand p_brand
> <> '[BRAND]'and p_type not like '[TYPE]%'and p_size in ([SIZE1],
> [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])and
> ps_suppkey not in (selects_suppkeyfromsupplierwheres_comment like
> '%Customer%Complaints%')group byp_brand,p_type,p_sizeorder
> bysupplier_cnt desc,p_brand,p_type,p_size;
> or like that:
> selectc_count, count(*) as custdistfrom
> (selectc_custkey,count(o_orderkey)fromcustomer left outer join
> orders onc_custkey = o_custkeyand o_comment not like ‘%[WORD1]%
> [WORD2]%’group byc_custkey)as c_orders (c_custkey, c_count)group
> byc_countorder bycustdist desc,c_count desc;
>
> and I like to know how I transform the clause exists, in or not in
> to a pig operator and thank you ?
>
>