You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Xuri Nagarin <se...@gmail.com> on 2014/05/11 23:18:22 UTC

Regex Lookup table

Hi,

Lets say I have a large data set, A, that is like:

user, verb, action, location

Example:
joe, said, I had a nice day, Tokyo
jane, paid, two dollars for a nice cup of coffee, Melbourne
jack, watched, an interesting movie, New York
jamie, said, I am interested in hiking, Austin

Another smaller data set, B, has a list of regex to match the "action"
and each regex has some other attribute associated with it, say,
category of action.
Example:
.*interest.*, explore
.*bank.*, account
.*tax.*, account
.*play.*, sports

What I want is that if "action" matches "regex" then join join sets A
and B such that I end up with tuple (user, verb, category of action,
location).

Right now, I have done this using a Java UDF where each A::action gets
evaluated against each B::regex for a match. If yes, returns the
desired tuple.

However, performance is slow. I am wondering if there is a better
strategy to do what I think is essentially a lookup table. I have seen
threads where replicated join has been recommended but obviously a
simple "join" isn't going to work for regex matching.

Any recommendations?

Thanks,

Xuri

Re: Regex Lookup table

Posted by Xuri Nagarin <se...@gmail.com>.

Reading around a bit more, it looks like the best method to do this to:
1. Copy the smaller dataset, B, to the distributed cache.
2. In the UDF args, tell the UDF how to parse B and what field from
the smaller dataset to use as "regex" (specify delimiter and index #)
3. Initialize the smaller dataset within the UDF as an instance
variable of some sort (definitely not read it within exec() ) because
pig will instantiate a UDF instance per mapper whereas exec() will get
called for each row/tuple of dataset A.
4. To the UDF pass relation to be matched (dataset A), location of
file representing dataset B, delimiter for each row of dataset B,
index number of field that contains the regex for B.
5. Return bag (and schema).

Use UDF as, joinedAndmatched = FOREACH A generate
matchAndJoin(filePath, delimiter, index) ;

Suggestions/comments?

On Sun, May 11, 2014 at 2:18 PM, Xuri Nagarin <se...@gmail.com> wrote:
> Hi,
>
> Lets say I have a large data set, A, that is like:
>
> user, verb, action, location
>
> Example:
> joe, said, I had a nice day, Tokyo
> jane, paid, two dollars for a nice cup of coffee, Melbourne
> jack, watched, an interesting movie, New York
> jamie, said, I am interested in hiking, Austin
>
> Another smaller data set, B, has a list of regex to match the "action"
> and each regex has some other attribute associated with it, say,
> category of action.
> Example:
> .*interest.*, explore
> .*bank.*, account
> .*tax.*, account
> .*play.*, sports
>
> What I want is that if "action" matches "regex" then join join sets A
> and B such that I end up with tuple (user, verb, category of action,
> location).
>
> Right now, I have done this using a Java UDF where each A::action gets
> evaluated against each B::regex for a match. If yes, returns the
> desired tuple.
>
> However, performance is slow. I am wondering if there is a better
> strategy to do what I think is essentially a lookup table. I have seen
> threads where replicated join has been recommended but obviously a
> simple "join" isn't going to work for regex matching.
>
> Any recommendations?
>
> Thanks,
>
> Xuri