You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Gang Luo <lg...@yahoo.com.cn> on 2010/06/25 15:32:15 UTC

compile load to mr plan

Hi,
multiple load operators in a script start the same number of streams, some of them are merged later (e.g. join) and some of them are not. How to know which MR Operator should we place these loads at? For example, we got script like this:
a = load file1
b = load file2
..
dump

if we join a and b between loads and dump, the two loads (a and b) should be placed in the same MR operator. If we sort a and b independently, these two loads should be placed in separate MR operators. How to identify these two streams are correlated or not?

A further question is, can we specify a directory so that load will read all the files in that directory? Since each reducer of a mr job will produce a single file, when the subsequent mr job need to read all these files, what do we do?

Thanks,
-Gang