You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2011/04/01 01:45:33 UTC
Re: transform sql to pig
In Pig Latin data flows are more linear than SQL, so SQL subqueries
tend to come first in Pig Latin scripts. Given that, your first SQL
query would look roughly like:
-- Find the records from s that will form the 'in' clause
A = load 'supplier' as (<whatever it's schema is>); --
B = filter A by s_comment matches '.*Customer.*Complaints.*';
C = foreach B generate s_suppkey; -- project out everything but the
field you're interested in
D = load 'partsupp' as (<schema>);
E = load 'part' as (<schema>);
F = join D by ps_partkey, E by p_partkey; -- do the join
G = filter F by p_brand != '[BRAND]'and p_type not matches
'[TYPE].*'and p_size == '[SIZE1]' or p_size == '[SIZE2]' ... -- do the
big harry filter, pretty much the same except Pig doesn't support the
IN syntax
-- Rewrite the subquery as an anti-join
H = cogroup G by ps_suppkey, C by s_suppkey);
I = filter H by COUNT(C) == 0;
J = foreach I generate flatten(G);
-- Now do the group by
K = group J by _brand,p_type,p_sizeorder bysupplier_cnt
desc,p_brand,p_type,p_size;
L = foreach K {
L1 = J.ps_suppkey;
L2 = distinct L1;
generate group, COUNT(L2);
}
I think that will more or less do it.
Alan.
On Mar 27, 2011, at 4:48 AM, hiba houimli wrote:
>
> Hi,
> I like to know how I can transform a query sql to a script pig
> especially when I have a query like that :
> selectp_brand,p_type,p_size,count(distinct ps_suppkey) as
> supplier_cntfrompartsupp,partwherep_partkey = ps_partkeyand p_brand
> <> '[BRAND]'and p_type not like '[TYPE]%'and p_size in ([SIZE1],
> [SIZE2], [SIZE3], [SIZE4], [SIZE5], [SIZE6], [SIZE7], [SIZE8])and
> ps_suppkey not in (selects_suppkeyfromsupplierwheres_comment like
> '%Customer%Complaints%')group byp_brand,p_type,p_sizeorder
> bysupplier_cnt desc,p_brand,p_type,p_size;
> or like that:
> selectc_count, count(*) as custdistfrom
> (selectc_custkey,count(o_orderkey)fromcustomer left outer join
> orders onc_custkey = o_custkeyand o_comment not like ‘%[WORD1]%
> [WORD2]%’group byc_custkey)as c_orders (c_custkey, c_count)group
> byc_countorder bycustdist desc,c_count desc;
>
> and I like to know how I transform the clause exists, in or not in
> to a pig operator and thank you ?
>
>