You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jamal sasha <ja...@gmail.com> on 2013/10/31 17:41:31 UTC

simple pig logic

Hi,
 I have two datasets..
main_data.txt
{"id":"foo", "some_field:12354, "score":0}
{"id":"foobar", "some_field:12354, "score":0}


score_data.txt
{"id":"foo", "score":1}
{"id":"foobar","score":20}
....


So in main_data.. score is initialized to 0..
Also.. main_data and score_data have some ids in common..

For the ids which are common:
I want to replace "score" in main_data with score in score_data

And if the element is absent.. then I want to let the score to 0 itself..

Re: simple pig logic

Posted by Pradeep Gollakota <pr...@gmail.com>.
If I understood your question correctly, given the following input:

main_data.txt
{"id": "foo", "some_field": 12354, "score": 0}
{"id": "foobar", "some_field": 12354, "score": 0}
{"id": "baz", "some_field": 12345, "score": 0}

score_data.txt
{"id": "foo", "score": 1}
{"id": "foobar", "score": 20}

you want the following output

{"id": "foo", "some_field": 12354, "score": 1}
{"id": "foobar", "some_field": 12354, "score": 20}
{"id": "baz", "some_field": 12345, "score": 0}

If that is correct, you can do a LEFT OUTER join on the two relations.

main = LOAD 'main_data.txt' as (id: chararray, some_field: int, score: int);
scores = LOAD 'score_data.txt' as (id: chararray, score: int);
both = JOIN main by id LEFT, scores by id;
final = FOREACH both GENERATE main::id as id, main::some_field as
some_field, (scores::score == null ? main::score : scores::score) as
score;
dump final;

After the join, check to see if the scores::score is null… if it is, choose
the default of main::score… if not choose scores::score.

Hope this helps!