You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Stuart White <st...@gmail.com> on 2012/02/20 20:50:44 UTC
Writing sorted output after merge join

Hello.  I'm a long-time MapReduce user, brand new Pig user.

My problem statement is:

- I have a large "master" file that I maintain sorted by a key field.
- Periodically, "transaction" files arrive.  They are much smaller
than the master file.  I sort the transaction file to match the sort
order of the master file, then perform a join on the two files.  As
part of this join process, I write a new version of the master file,
containing any new information learned from the transaction file.

With this process, I never have to sort the master file.

I have an existing application (written in java/mapreduce) that
performs this process.  I use a map-side join.  I sort/partition the
transaction files to match the sort/partitioning of the master file
and then run a mapside join.

I just started learning Pig last week, and I'm trying to implement
this same process using Pig.  Here's what I've learned so far:

- In Pig, what I want to perform is a "merge join".
- Because I require a full outer join, I need to use a Zebra
TableLoader as the loaders for the master and transaction files.

For a simple example, imagine my master file contains a single field
('key') and contains a single record:

KEY_1

Now imagine a transaction file arrives containing a single field
('key') and a single record:

KEY_2

After the join, the new master file should contain both records, and
still be sorted:

KEY_1
KEY_2

Here's the Pig Latin I've written for this:

register /path/to/zebra-0.8.0-dev.jar
A = load 'master' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'transaction' using
org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by key full outer, B by key using 'merge';
D = foreach C generate (A::key is not null ? A::key : B::key);
store D into 'master.v2' using org.apache.hadoop.zebra.pig.TableStorer('');

After running this script, master.v2 is the new revision of the master
file, containing new information learned from the transaction file.
And its records are sorted.  The problem is: the Zebra TableStorer
doesn't *know* that the records are sorted.  So, when I try to run my
process again the next time another transaction file arrives, I get a
"table is not sorted" error.

Can someone provide any suggestions on how to implement something like
this?  Something where I can maintain a large master file in sorted
order, and repeatedly join it to smaller transaction files, and never
have to sort the master file?

Thanks in advance!