You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Something Something <ma...@gmail.com> on 2013/05/22 09:21:37 UTC

pig.keyDistFile

Hello,

Our data is skewed, so we are using a 'skewed' join but still the 'Join'
operation is taking a long time.  From the documentation, it appears Pig
samples data & creates a file that is passed using 'pig.keyDistFile'
config.  It also appears that for our data this sample is a bit biased.

Our data for this Join is pretty static & we think we can create a better
sample.  Questions are:

1)  If we pass -Dpig.keyDistFile=/path/to/our data, would that work?
2)  Is the format of this file:

key1,from, to   e.g.   key1, 0, 4  (Key1 will be distributed to first 4
reducers?)

3)  We want to specify only 10 keys in this file.  The rest can go thru
normal processing, so should we just omit them from this file?

4)  Feel free to say, this is a terrible idea - don't do this -:)  But then
please suggest a better idea ;)

Thanks for your time.