You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jonathan Holloway <jo...@gmail.com> on 2011/04/01 17:50:00 UTC

Pig Query

Hi all,

I'm trying to do something with Pig and I'm not quite sure whether it's
possible
or not.  Hoping somebody could provide with some help on how to proceed
here.

I have a log file with a number of log lines that have relationships with
each other.
The structure of the log line is:

DATE, UUID, CATNAME, DESCRIPTION, ID, PARENTS

Examples of this include:

DATE, UUID, Apple, this is a log line, id:9, parent:8,9
DATE, UUID, Vegetables, this is a log line, id:4
DATE, UUID, Carrots, this is a log line, id:6, parent:4,5
DATE, UUID, Pineapple, this is a log line, id:8, parent:7
DATE, UUID, Potato, id:12, parent:11,12, this is a log line,
DATE, UUID, Parsnip, this is a log line, id:5, parent:4
DATE, UUID, Fruit, this is a log line, id:7
DATE, UUID, Vegetables, id:10, this is a log line,
DATE, UUID, Beetroot, id:11, parent:10, this is a log line,

I'm currently extracting these using Pig into a schema, and I can order them
based on
UUID, CATNAME, ID, PARENTS to give me an ordered list of lines.  The above
would be
transformed into the following (including the description).

UUID, Vegetables, id:4, this is a log line,
UUID, Parsnip, id:5, parent:4, this is a log line,
UUID, Carrots, id:6, parent:4,5, this is a log line,
UUID, Fruit, id:7, this is a log line,
UUID, Apple, id:9, parent:8,9, this is a log line,
UUID, Pineapple, id:8, parent:7, this is a log line,
UUID, Vegetables, id:10, this is a log line,
UUID, Beetroot, id:11, parent:10, this is a log line,
UUID, Potato, id:12, parent:11,12, this is a log line,

What I'm then trying to do is generate a report for each CATNAME that
includes the
children for that operation.  If I specified 'Vegetables' the resulting
report
should appear like this:

UUID, Vegetables, this is a log line, id:4
 -  UUID, Parsnip, this is a log line, id:5, parent:4
 - UUID, Carrots, this is a log line, id:6, parent:4,5

UUID, Vegetables, this is a log line, id:10,
 - UUID, Beetroot, this is a log line, id:11, parent:10
 - UUID, Potato, this is a log line, id:12, parent:11,12

I'm not quite sure how to do this in Pig - the difficulty is the
relationship.  I was
thinking I could:

A = Group all log lines by CATNAME
B = Filter all log lines that have a non null parent field
C = FOREACH B Extract the parent: field, parse and lookup each log line with
an id
matching in the parent field.  Possibly using a custom UDF to do this. I
could have
thousands of CATNAME's though.

Does anybody have an idea on how I could do this in Pig?

Many thanks in advance,
Jon.