You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by patcharee <Pa...@uni.no> on 2015/05/29 10:35:18 UTC

pig performance on reading/filtering orc file

Hi,

I am using pig 0.14 to work on partitioned orc file. I tried to improve 
my pig performance. However I am curious why using filter at the 
beginning (approach 1) does not help and takes even longer times than 
replicated join (approach 2). This filter is supposed to cut down a lot 
of data to be taken from orc file. Is this related to how I partition 
the orc file? Any guidelines/suggestions are appreciated.

---------------
Approach 1
---------------
coordinate = LOAD 'coordinate' USING 
org.apache.hive.hcatalog.pig.HCatLoader();
coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
rawdata_u_1 = foreach rawdata_u generate 
date,hh,(double)xlong_u,(double)xlat_u,height,u,zone,year,month;
u_filter = FILTER rawdata_u_1 by zone == 2;

/**** HERE I try to filter and expect to get better performance, but it 
is not ****/
u_filter = FILTER u_filter by xlong_u == coordinate_xy.xlong_u and 
xlat_u == coordinate_xy.xlat_u;

---------------
Approach 2
---------------
coordinate = LOAD 'coordinate' USING 
org.apache.hive.hcatalog.pig.HCatLoader();
coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
u_filter = FILTER rawdata_u by zone == 2
join_u_coordinate_cossin = join u_filter by (xlong_u, xlat_u), 
coordinate_xy by (xlong_u, xlat_u) USING 'replicated';


Best,
Patcharee