You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mark Olliver <ma...@infectiousmedia.com> on 2013/10/01 07:38:45 UTC

Index list

Hi All,

I am a little stuck on something and after advise on the best path. I have
a file which has within it a list of domain names, from that I can obtain a
unique list of the domains I have.
The harder part is I know need to index that list so that I end up with a
list of domain name,index. I have considered using a python UDF as I don't
think this is possible in pig directly but how do I then pass in the whole
list and retrieve back the whole list with indexes.
On top of that, once I have the new list I will store that on s3 (easy
bit). But next time I run the job I need to merge that list with the new
domains, keeping the old indexes and giving any new records and index
counting up from there. I expect here I could join the old and new unique
lists before passing to python, then in python know I only need to add an
index to the new records. I presume I can do the python part by effectively
just creating an array, but I guess my real issue is how I ensure this part
of my job is only handled as a reduce and not a map job, after which I
would then be running more map jobs.

Thanks

Mark

Mark Olliver
DevOps
InfectiousMedia