You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Shuai Zheng <sz...@gmail.com> on 2015/12/16 22:50:24 UTC

How to partially disable sampling in PIG order by process

Hi All,

 

I use PIG to process some of my data, and I face an issue.

 

I have a lot of data, I want them to be sort and also group by key, and put
into files (for later other java program to process them)

 

For example, my data is: col-k1, col-k2, col-v1

 

I want the data is order by col-k1 and col-k2, and at the same time, the
output file is separated by the key col-k1 only.

 

I can find order by behavior below:

 

cid:image001.png@01D13821.198FDD70

I like the idea of sampling, but how can I still enforce all e (in above
example) into one file? So I want a balanced result set, but I don't want to
a key goes to different reducer? How can I do it simply in Pig?

 

I know I can do this by own MR code, but it will be quite troublesome
because I have a lot of similar requirement. Anyone has any idea?

 

BTW: I remember this sampling is a feature added later (because earlier
version don't have this and result always group by the key), any parameter I
can use to tune or disable this feature in PIG?

 

Regards,

 

Shuai