You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Stuart White <st...@gmail.com> on 2012/02/20 20:50:44 UTC
Writing sorted output after merge join
Hello. I'm a long-time MapReduce user, brand new Pig user.
My problem statement is:
- I have a large "master" file that I maintain sorted by a key field.
- Periodically, "transaction" files arrive. They are much smaller
than the master file. I sort the transaction file to match the sort
order of the master file, then perform a join on the two files. As
part of this join process, I write a new version of the master file,
containing any new information learned from the transaction file.
With this process, I never have to sort the master file.
I have an existing application (written in java/mapreduce) that
performs this process. I use a map-side join. I sort/partition the
transaction files to match the sort/partitioning of the master file
and then run a mapside join.
I just started learning Pig last week, and I'm trying to implement
this same process using Pig. Here's what I've learned so far:
- In Pig, what I want to perform is a "merge join".
- Because I require a full outer join, I need to use a Zebra
TableLoader as the loaders for the master and transaction files.
For a simple example, imagine my master file contains a single field
('key') and contains a single record:
KEY_1
Now imagine a transaction file arrives containing a single field
('key') and a single record:
KEY_2
After the join, the new master file should contain both records, and
still be sorted:
KEY_1
KEY_2
Here's the Pig Latin I've written for this:
register /path/to/zebra-0.8.0-dev.jar
A = load 'master' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'transaction' using
org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by key full outer, B by key using 'merge';
D = foreach C generate (A::key is not null ? A::key : B::key);
store D into 'master.v2' using org.apache.hadoop.zebra.pig.TableStorer('');
After running this script, master.v2 is the new revision of the master
file, containing new information learned from the transaction file.
And its records are sorted. The problem is: the Zebra TableStorer
doesn't *know* that the records are sorted. So, when I try to run my
process again the next time another transaction file arrives, I get a
"table is not sorted" error.
Can someone provide any suggestions on how to implement something like
this? Something where I can maintain a large master file in sorted
order, and repeatedly join it to smaller transaction files, and never
have to sort the master file?
Thanks in advance!