You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2010/06/10 03:54:49 UTC

Help with a tricky query

Hi everyone, today I came across with a particular query that I don't know
how to model in PIG. Part of my data looks like this:

Id1 Id2 Sc Va P1 P2
--------- --------- ----- --------- ----- ----
770011 990201 401 1e-125 100 65
990201 770011 440 1e-125 100 42
770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100
990201 770083 341 1e-125 73 41
770083 990201 421 1e-125 90 85
.
.
.

what I would like to retrieve is something like
this:                                             770011 990201 770083
because they are records actually related.
Any kind of ideas are highly appreciated. Thanks in advanced.

Renato M.

Re: Help with a tricky query

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi everybody, sorry for not being able to answer back before, but I have
just been way too busy lately.
Thanks a lot I think that will work, I haven't been able to test yet, but I
am getting there these week, I will post about it later. Thanks again.

Renato M.

2010/6/13 Dmitriy Ryaboy <dv...@gmail.com>

> I don't see why not.
>
> If you have (start, end) pairs: (a, b), (b, c), (c, a)
> And let's say there's something non-triangular also: (c, d), (d, e), (e, f)
>
> You want to self join on rel.start == rel.end, which will produce
>
> a, (a, b), (c, a)
> b, (b, c), (a, b)
> d, (d, e), (c, d)
>
> You will now want to pull out start, middle, and end. That's easy -- it's
> just ($2.$0, $0, $1.$1)
>
> So now, after a simple join and a foreach-generate, we have
>
> (c, a, b)
> (a, b, c)
> (c, d, e)
>
> Now we want to match, all segments where the start is equal to the end of
> another segment, and the middle is the beginning of that segment
>
> (c, a), (c, a, b), (a, b, c)
>
> Your triple is either $1 or $2 -- they are essentially the same.
>
> You might be able to do a group instead of a join for the last step, if
> certain conditions hold (for example, you must be guaranteed that this does
> not exist: a->b->c->b->a)
> In that case, you could sort the contents of the triples and group on the
> result, saving only those results that have > 1 entry in the group. This
> would be faster as you would need to shuffle only a single copy of the
> data.
>
> -D
>
> On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Hi everybody, thanks again for the responses. I am really obtaining many
> > ideas (:
> >
> > About the depth of my data, yeah I am only looking for connected
> components
> > of depth three which I think will be a great help because my BFS would be
> > limited in the number of iterations.
> >
> > And about the example I must have these records x -> y, y -> z and z -> x
> > in
> > order to output the x, y, z result record. But there can probably be
> > elements who do not close the graph. For example, I might have a record
> > like
> > x -> b, and b doesn't references nobody, so I would have to ignore it. Is
> > that what you meant Tanton?
> > What about a self-join three times?Would that be a bad idea?
> > Thanks again.
> >
> > Renato M.
> >
> > 2010/6/11 Dmitriy Ryaboy <dv...@gmail.com>
> >
> > > hc, self-join should just work, but if it doesn't:
> > >
> > > split table into table_2 if 1==1, table3 if 1==1;
> > >
> > > OR
> > >
> > > table_2 = foreach table generate *;
> > > table3 = foreach table generate *;
> > >
> > > AND THEN
> > >
> > > T = join table by id1, table_2 by id2, table_3 by id3
> > >
> > > -D
> > >
> > > On Fri, Jun 11, 2010 at 10:59 AM, hc busy <hc...@gmail.com> wrote:
> > >
> > > > Yeah, that IS hard in pig. I'm not even sure how to do a self-join in
> > > Pig.
> > > > Like you can't really say
> > > >
> > > > T = join Table by id1, Table by id2, Table by id3;
> > > >
> > > > I think PigLatin will complain that it's confused which Table is and
> > > which
> > > > id1 goes with which table.
> > > >
> > > > I had been proposing that we allow PigLatin to allow recursive
> > functions,
> > > > that way we can do loops. But recursive functions doesn't fit in the
> > > > data-flow language paradigm.
> > > >
> > > > But I think people have offered many alternatives in terms of
> scripting
> > > > language that can wrap PigLatin that has loop and flow control
> > > constructs.
> > > >
> > > >
> > > > On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <
> tanton.gibbs@gmail.com
> > > > >wrote:
> > > >
> > > > > It really depends on the depth of your graph.  Are you only dealing
> > > > > with connected components of depth 3 or could they be deeper.
> > > > >
> > > > > For instance, can you have x -> y, y -> z, and z -> a?  If so, do
> you
> > > > > need your record to be x, y, z, a?  Or, are you guaranteed that it
> > > > > will always be x -> z and y -> z?
> > > > >
> > > > > I really need more information about your data before I can offer
> too
> > > > > much advice.
> > > > >
> > > > > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> > > > > <re...@gmail.com> wrote:
> > > > > > Hi everybody, thanks a lot for your responses.
> > > > > >
> > > > > > I am actually not looking for a transitive closure, I am not
> trying
> > > to
> > > > > > "infer" that given *x -> y* and *y -> z*, then *x -> z
> *(@hc.busy:
> > > that
> > > > > is
> > > > > > in short terms a transitive closure ^^)  because I have the data
> > that
> > > > > proves
> > > > > > it. And yeah I am aware of first logic expressive power, but
> maybe
> > I
> > > > will
> > > > > > think about giving an inference engine a try some time in the
> > future.
> > > > > >
> > > > > > I am actually looking at a connected graph-like problem. The
> sample
> > > > > records
> > > > > > resemble a bidirectional triangle.
> > > > > >
> > > > > >            990201
> > > > > >         /              \
> > > > > >        /                \
> > > > > >       /                  \
> > > > > >      /                    \
> > > > > > 770011  ------- 770083
> > > > > >
> > > > > > I tried using a smaller version of the problem by using SQL and
> got
> > a
> > > > > > horribly huge query which is a non-scalable possibility for me. I
> > > have
> > > > > over
> > > > > > 500Gbs in structured files. I used a triple join but I had to
> > > retrieve
> > > > > all
> > > > > > existing possibilities.
> > > > > > So that is why I thought on using something like Hc.busy
> mentioned.
> > > > > >
> > > > > > Queries: idx1, idx2, sc, va, p1, p2
> > > > > > possibilities: possibs = foreach queries generate idx1, flatten(a
> > > > triple
> > > > > > self join of the data)
> > > > > >
> > > > > > and by using the flattening command, get all possible
> combinations,
> > > but
> > > > I
> > > > > am
> > > > > > not sure that would be the correct approach. Anyhow I thought
> maybe
> > > > > > performing some breadth first search but that would give me an
> > > > algorithm
> > > > > > O(n2)  so  )=
> > > > > > Hey Tanton how would you implement and use an union-find
> structure?
> > > Do
> > > > > you
> > > > > > think it is possible with PIG?
> > > > > >
> > > > > > Thanks again.
> > > > > >
> > > > > >
> > > > > > 2010/6/10 Tanton Gibbs <ta...@gmail.com>
> > > > > >
> > > > > >> To be specific, he's looking for connected components in the
> > graph.
> > > > > >>
> > > > > >> It's not overly easy to do in an ETL tool (or in pig), but if
> you
> > > can
> > > > > >> wrap the script in a loop it is possible.
> > > > > >>
> > > > > >> There are actually some really interesting parallel algorithms
> for
> > > > > >> finding connected components.  If you know you are only going to
> > > have
> > > > > >> two keys, it is a bit simpler to code (but not any more
> > efficient).
> > > > > >> Essentially, you can implement your union-find algorithm as a
> > series
> > > > > >> of sorts and merges, which on a large parallel system is
> actually
> > > > > >> quite fast.
> > > > > >>
> > > > > >> Tanton
> > > > > >>
> > > > > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com>
> > wrote:
> > > > > >> > What's a transitive closure?
> > > > > >> >
> > > > > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
> > > gianmarco.dfm@gmail.com
> > > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I think what he wants is a transitive closure of the
> relation,
> > > > which
> > > > > >> >> is not achievable in SQL-like languages alone (first order
> > logic
> > > > > >> >> expressive power).
> > > > > >> >> I suppose Pig Latin falls in this category.
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> -- Gianmarco
> > > > > >> >>
> > > > > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com>
> > > wrote:
> > > > > >> >> > Is this like a tricky interview question? I don't see the
> > > pattern
> > > > > >> between
> > > > > >> >> > those three numbers you listed and the sample of the table.
> > > > > >> >> >
> > > > > >> >> > 770011 770083 524 1e-120 89 12
> > > > > >> >> > 770083 770011 494 1e-120 39 100
> > > > > >> >> >
> > > > > >> >> > ahh, I guess these are related because id1=id2 an
> id2=id1...
> > > > Here's
> > > > > a
> > > > > >> >> first
> > > > > >> >> > pass at the problem. Project:
> > > > > >> >> >
> > > > > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > > > > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > > > > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> > > > > >> >> >
> > > > > >> >> > and now J contains pairs of rows from original table where
> > id1
> > > > and
> > > > > id2
> > > > > >> >> are
> > > > > >> >> > reversed.
> > > > > >> >> >
> > > > > >> >> > is this what you want?
> > > > > >> >> >
> > > > > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo
> <
> > > > > >> >> > renatoj.marroquin@gmail.com> wrote:
> > > > > >> >> >
> > > > > >> >> >> Hi everyone, today I came across with a particular query
> > that
> > > I
> > > > > don't
> > > > > >> >> know
> > > > > >> >> >> how to model in PIG. Part of my data looks like this:
> > > > > >> >> >>
> > > > > >> >> >> Id1 Id2 Sc Va P1 P2
> > > > > >> >> >> --------- --------- ----- --------- ----- ----
> > > > > >> >> >> 770011 990201 401 1e-125 100 65
> > > > > >> >> >> 990201 770011 440 1e-125 100 42
> > > > > >> >> >> 770011 770083 524 1e-120 89 12
> > > > > >> >> >> 770083 770011 494 1e-120 39 100
> > > > > >> >> >> 990201 770083 341 1e-125 73 41
> > > > > >> >> >> 770083 990201 421 1e-125 90 85
> > > > > >> >> >> .
> > > > > >> >> >> .
> > > > > >> >> >> .
> > > > > >> >> >>
> > > > > >> >> >> what I would like to retrieve is something like
> > > > > >> >> >> this:                                             770011
> > > 990201
> > > > > >> 770083
> > > > > >> >> >> because they are records actually related.
> > > > > >> >> >> Any kind of ideas are highly appreciated. Thanks in
> > advanced.
> > > > > >> >> >>
> > > > > >> >> >> Renato M.
> > > > > >> >> >>
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Help with a tricky query

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I don't see why not.

If you have (start, end) pairs: (a, b), (b, c), (c, a)
And let's say there's something non-triangular also: (c, d), (d, e), (e, f)

You want to self join on rel.start == rel.end, which will produce

a, (a, b), (c, a)
b, (b, c), (a, b)
d, (d, e), (c, d)

You will now want to pull out start, middle, and end. That's easy -- it's
just ($2.$0, $0, $1.$1)

So now, after a simple join and a foreach-generate, we have

(c, a, b)
(a, b, c)
(c, d, e)

Now we want to match, all segments where the start is equal to the end of
another segment, and the middle is the beginning of that segment

(c, a), (c, a, b), (a, b, c)

Your triple is either $1 or $2 -- they are essentially the same.

You might be able to do a group instead of a join for the last step, if
certain conditions hold (for example, you must be guaranteed that this does
not exist: a->b->c->b->a)
In that case, you could sort the contents of the triples and group on the
result, saving only those results that have > 1 entry in the group. This
would be faster as you would need to shuffle only a single copy of the data.

-D

On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi everybody, thanks again for the responses. I am really obtaining many
> ideas (:
>
> About the depth of my data, yeah I am only looking for connected components
> of depth three which I think will be a great help because my BFS would be
> limited in the number of iterations.
>
> And about the example I must have these records x -> y, y -> z and z -> x
> in
> order to output the x, y, z result record. But there can probably be
> elements who do not close the graph. For example, I might have a record
> like
> x -> b, and b doesn't references nobody, so I would have to ignore it. Is
> that what you meant Tanton?
> What about a self-join three times?Would that be a bad idea?
> Thanks again.
>
> Renato M.
>
> 2010/6/11 Dmitriy Ryaboy <dv...@gmail.com>
>
> > hc, self-join should just work, but if it doesn't:
> >
> > split table into table_2 if 1==1, table3 if 1==1;
> >
> > OR
> >
> > table_2 = foreach table generate *;
> > table3 = foreach table generate *;
> >
> > AND THEN
> >
> > T = join table by id1, table_2 by id2, table_3 by id3
> >
> > -D
> >
> > On Fri, Jun 11, 2010 at 10:59 AM, hc busy <hc...@gmail.com> wrote:
> >
> > > Yeah, that IS hard in pig. I'm not even sure how to do a self-join in
> > Pig.
> > > Like you can't really say
> > >
> > > T = join Table by id1, Table by id2, Table by id3;
> > >
> > > I think PigLatin will complain that it's confused which Table is and
> > which
> > > id1 goes with which table.
> > >
> > > I had been proposing that we allow PigLatin to allow recursive
> functions,
> > > that way we can do loops. But recursive functions doesn't fit in the
> > > data-flow language paradigm.
> > >
> > > But I think people have offered many alternatives in terms of scripting
> > > language that can wrap PigLatin that has loop and flow control
> > constructs.
> > >
> > >
> > > On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
> > > >wrote:
> > >
> > > > It really depends on the depth of your graph.  Are you only dealing
> > > > with connected components of depth 3 or could they be deeper.
> > > >
> > > > For instance, can you have x -> y, y -> z, and z -> a?  If so, do you
> > > > need your record to be x, y, z, a?  Or, are you guaranteed that it
> > > > will always be x -> z and y -> z?
> > > >
> > > > I really need more information about your data before I can offer too
> > > > much advice.
> > > >
> > > > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> > > > <re...@gmail.com> wrote:
> > > > > Hi everybody, thanks a lot for your responses.
> > > > >
> > > > > I am actually not looking for a transitive closure, I am not trying
> > to
> > > > > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy:
> > that
> > > > is
> > > > > in short terms a transitive closure ^^)  because I have the data
> that
> > > > proves
> > > > > it. And yeah I am aware of first logic expressive power, but maybe
> I
> > > will
> > > > > think about giving an inference engine a try some time in the
> future.
> > > > >
> > > > > I am actually looking at a connected graph-like problem. The sample
> > > > records
> > > > > resemble a bidirectional triangle.
> > > > >
> > > > >            990201
> > > > >         /              \
> > > > >        /                \
> > > > >       /                  \
> > > > >      /                    \
> > > > > 770011  ------- 770083
> > > > >
> > > > > I tried using a smaller version of the problem by using SQL and got
> a
> > > > > horribly huge query which is a non-scalable possibility for me. I
> > have
> > > > over
> > > > > 500Gbs in structured files. I used a triple join but I had to
> > retrieve
> > > > all
> > > > > existing possibilities.
> > > > > So that is why I thought on using something like Hc.busy mentioned.
> > > > >
> > > > > Queries: idx1, idx2, sc, va, p1, p2
> > > > > possibilities: possibs = foreach queries generate idx1, flatten(a
> > > triple
> > > > > self join of the data)
> > > > >
> > > > > and by using the flattening command, get all possible combinations,
> > but
> > > I
> > > > am
> > > > > not sure that would be the correct approach. Anyhow I thought maybe
> > > > > performing some breadth first search but that would give me an
> > > algorithm
> > > > > O(n2)  so  )=
> > > > > Hey Tanton how would you implement and use an union-find structure?
> > Do
> > > > you
> > > > > think it is possible with PIG?
> > > > >
> > > > > Thanks again.
> > > > >
> > > > >
> > > > > 2010/6/10 Tanton Gibbs <ta...@gmail.com>
> > > > >
> > > > >> To be specific, he's looking for connected components in the
> graph.
> > > > >>
> > > > >> It's not overly easy to do in an ETL tool (or in pig), but if you
> > can
> > > > >> wrap the script in a loop it is possible.
> > > > >>
> > > > >> There are actually some really interesting parallel algorithms for
> > > > >> finding connected components.  If you know you are only going to
> > have
> > > > >> two keys, it is a bit simpler to code (but not any more
> efficient).
> > > > >> Essentially, you can implement your union-find algorithm as a
> series
> > > > >> of sorts and merges, which on a large parallel system is actually
> > > > >> quite fast.
> > > > >>
> > > > >> Tanton
> > > > >>
> > > > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com>
> wrote:
> > > > >> > What's a transitive closure?
> > > > >> >
> > > > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
> > gianmarco.dfm@gmail.com
> > > >
> > > > >> wrote:
> > > > >> >
> > > > >> >> I think what he wants is a transitive closure of the relation,
> > > which
> > > > >> >> is not achievable in SQL-like languages alone (first order
> logic
> > > > >> >> expressive power).
> > > > >> >> I suppose Pig Latin falls in this category.
> > > > >> >>
> > > > >> >>
> > > > >> >> -- Gianmarco
> > > > >> >>
> > > > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com>
> > wrote:
> > > > >> >> > Is this like a tricky interview question? I don't see the
> > pattern
> > > > >> between
> > > > >> >> > those three numbers you listed and the sample of the table.
> > > > >> >> >
> > > > >> >> > 770011 770083 524 1e-120 89 12
> > > > >> >> > 770083 770011 494 1e-120 39 100
> > > > >> >> >
> > > > >> >> > ahh, I guess these are related because id1=id2 an id2=id1...
> > > Here's
> > > > a
> > > > >> >> first
> > > > >> >> > pass at the problem. Project:
> > > > >> >> >
> > > > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > > > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > > > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> > > > >> >> >
> > > > >> >> > and now J contains pairs of rows from original table where
> id1
> > > and
> > > > id2
> > > > >> >> are
> > > > >> >> > reversed.
> > > > >> >> >
> > > > >> >> > is this what you want?
> > > > >> >> >
> > > > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> > > > >> >> > renatoj.marroquin@gmail.com> wrote:
> > > > >> >> >
> > > > >> >> >> Hi everyone, today I came across with a particular query
> that
> > I
> > > > don't
> > > > >> >> know
> > > > >> >> >> how to model in PIG. Part of my data looks like this:
> > > > >> >> >>
> > > > >> >> >> Id1 Id2 Sc Va P1 P2
> > > > >> >> >> --------- --------- ----- --------- ----- ----
> > > > >> >> >> 770011 990201 401 1e-125 100 65
> > > > >> >> >> 990201 770011 440 1e-125 100 42
> > > > >> >> >> 770011 770083 524 1e-120 89 12
> > > > >> >> >> 770083 770011 494 1e-120 39 100
> > > > >> >> >> 990201 770083 341 1e-125 73 41
> > > > >> >> >> 770083 990201 421 1e-125 90 85
> > > > >> >> >> .
> > > > >> >> >> .
> > > > >> >> >> .
> > > > >> >> >>
> > > > >> >> >> what I would like to retrieve is something like
> > > > >> >> >> this:                                             770011
> > 990201
> > > > >> 770083
> > > > >> >> >> because they are records actually related.
> > > > >> >> >> Any kind of ideas are highly appreciated. Thanks in
> advanced.
> > > > >> >> >>
> > > > >> >> >> Renato M.
> > > > >> >> >>
> > > > >> >> >
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Help with a tricky query

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi everybody, thanks again for the responses. I am really obtaining many
ideas (:

About the depth of my data, yeah I am only looking for connected components
of depth three which I think will be a great help because my BFS would be
limited in the number of iterations.

And about the example I must have these records x -> y, y -> z and z -> x in
order to output the x, y, z result record. But there can probably be
elements who do not close the graph. For example, I might have a record like
x -> b, and b doesn't references nobody, so I would have to ignore it. Is
that what you meant Tanton?
What about a self-join three times?Would that be a bad idea?
Thanks again.

Renato M.

2010/6/11 Dmitriy Ryaboy <dv...@gmail.com>

> hc, self-join should just work, but if it doesn't:
>
> split table into table_2 if 1==1, table3 if 1==1;
>
> OR
>
> table_2 = foreach table generate *;
> table3 = foreach table generate *;
>
> AND THEN
>
> T = join table by id1, table_2 by id2, table_3 by id3
>
> -D
>
> On Fri, Jun 11, 2010 at 10:59 AM, hc busy <hc...@gmail.com> wrote:
>
> > Yeah, that IS hard in pig. I'm not even sure how to do a self-join in
> Pig.
> > Like you can't really say
> >
> > T = join Table by id1, Table by id2, Table by id3;
> >
> > I think PigLatin will complain that it's confused which Table is and
> which
> > id1 goes with which table.
> >
> > I had been proposing that we allow PigLatin to allow recursive functions,
> > that way we can do loops. But recursive functions doesn't fit in the
> > data-flow language paradigm.
> >
> > But I think people have offered many alternatives in terms of scripting
> > language that can wrap PigLatin that has loop and flow control
> constructs.
> >
> >
> > On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
> > >wrote:
> >
> > > It really depends on the depth of your graph.  Are you only dealing
> > > with connected components of depth 3 or could they be deeper.
> > >
> > > For instance, can you have x -> y, y -> z, and z -> a?  If so, do you
> > > need your record to be x, y, z, a?  Or, are you guaranteed that it
> > > will always be x -> z and y -> z?
> > >
> > > I really need more information about your data before I can offer too
> > > much advice.
> > >
> > > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> > > <re...@gmail.com> wrote:
> > > > Hi everybody, thanks a lot for your responses.
> > > >
> > > > I am actually not looking for a transitive closure, I am not trying
> to
> > > > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy:
> that
> > > is
> > > > in short terms a transitive closure ^^)  because I have the data that
> > > proves
> > > > it. And yeah I am aware of first logic expressive power, but maybe I
> > will
> > > > think about giving an inference engine a try some time in the future.
> > > >
> > > > I am actually looking at a connected graph-like problem. The sample
> > > records
> > > > resemble a bidirectional triangle.
> > > >
> > > >            990201
> > > >         /              \
> > > >        /                \
> > > >       /                  \
> > > >      /                    \
> > > > 770011  ------- 770083
> > > >
> > > > I tried using a smaller version of the problem by using SQL and got a
> > > > horribly huge query which is a non-scalable possibility for me. I
> have
> > > over
> > > > 500Gbs in structured files. I used a triple join but I had to
> retrieve
> > > all
> > > > existing possibilities.
> > > > So that is why I thought on using something like Hc.busy mentioned.
> > > >
> > > > Queries: idx1, idx2, sc, va, p1, p2
> > > > possibilities: possibs = foreach queries generate idx1, flatten(a
> > triple
> > > > self join of the data)
> > > >
> > > > and by using the flattening command, get all possible combinations,
> but
> > I
> > > am
> > > > not sure that would be the correct approach. Anyhow I thought maybe
> > > > performing some breadth first search but that would give me an
> > algorithm
> > > > O(n2)  so  )=
> > > > Hey Tanton how would you implement and use an union-find structure?
> Do
> > > you
> > > > think it is possible with PIG?
> > > >
> > > > Thanks again.
> > > >
> > > >
> > > > 2010/6/10 Tanton Gibbs <ta...@gmail.com>
> > > >
> > > >> To be specific, he's looking for connected components in the graph.
> > > >>
> > > >> It's not overly easy to do in an ETL tool (or in pig), but if you
> can
> > > >> wrap the script in a loop it is possible.
> > > >>
> > > >> There are actually some really interesting parallel algorithms for
> > > >> finding connected components.  If you know you are only going to
> have
> > > >> two keys, it is a bit simpler to code (but not any more efficient).
> > > >> Essentially, you can implement your union-find algorithm as a series
> > > >> of sorts and merges, which on a large parallel system is actually
> > > >> quite fast.
> > > >>
> > > >> Tanton
> > > >>
> > > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
> > > >> > What's a transitive closure?
> > > >> >
> > > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
> gianmarco.dfm@gmail.com
> > >
> > > >> wrote:
> > > >> >
> > > >> >> I think what he wants is a transitive closure of the relation,
> > which
> > > >> >> is not achievable in SQL-like languages alone (first order logic
> > > >> >> expressive power).
> > > >> >> I suppose Pig Latin falls in this category.
> > > >> >>
> > > >> >>
> > > >> >> -- Gianmarco
> > > >> >>
> > > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com>
> wrote:
> > > >> >> > Is this like a tricky interview question? I don't see the
> pattern
> > > >> between
> > > >> >> > those three numbers you listed and the sample of the table.
> > > >> >> >
> > > >> >> > 770011 770083 524 1e-120 89 12
> > > >> >> > 770083 770011 494 1e-120 39 100
> > > >> >> >
> > > >> >> > ahh, I guess these are related because id1=id2 an id2=id1...
> > Here's
> > > a
> > > >> >> first
> > > >> >> > pass at the problem. Project:
> > > >> >> >
> > > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> > > >> >> >
> > > >> >> > and now J contains pairs of rows from original table where id1
> > and
> > > id2
> > > >> >> are
> > > >> >> > reversed.
> > > >> >> >
> > > >> >> > is this what you want?
> > > >> >> >
> > > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> > > >> >> > renatoj.marroquin@gmail.com> wrote:
> > > >> >> >
> > > >> >> >> Hi everyone, today I came across with a particular query that
> I
> > > don't
> > > >> >> know
> > > >> >> >> how to model in PIG. Part of my data looks like this:
> > > >> >> >>
> > > >> >> >> Id1 Id2 Sc Va P1 P2
> > > >> >> >> --------- --------- ----- --------- ----- ----
> > > >> >> >> 770011 990201 401 1e-125 100 65
> > > >> >> >> 990201 770011 440 1e-125 100 42
> > > >> >> >> 770011 770083 524 1e-120 89 12
> > > >> >> >> 770083 770011 494 1e-120 39 100
> > > >> >> >> 990201 770083 341 1e-125 73 41
> > > >> >> >> 770083 990201 421 1e-125 90 85
> > > >> >> >> .
> > > >> >> >> .
> > > >> >> >> .
> > > >> >> >>
> > > >> >> >> what I would like to retrieve is something like
> > > >> >> >> this:                                             770011
> 990201
> > > >> 770083
> > > >> >> >> because they are records actually related.
> > > >> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> > > >> >> >>
> > > >> >> >> Renato M.
> > > >> >> >>
> > > >> >> >
> > > >> >>
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Help with a tricky query

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

hc, self-join should just work, but if it doesn't:

split table into table_2 if 1==1, table3 if 1==1;

OR

table_2 = foreach table generate *;
table3 = foreach table generate *;

AND THEN

T = join table by id1, table_2 by id2, table_3 by id3

-D

On Fri, Jun 11, 2010 at 10:59 AM, hc busy <hc...@gmail.com> wrote:

> Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
> Like you can't really say
>
> T = join Table by id1, Table by id2, Table by id3;
>
> I think PigLatin will complain that it's confused which Table is and which
> id1 goes with which table.
>
> I had been proposing that we allow PigLatin to allow recursive functions,
> that way we can do loops. But recursive functions doesn't fit in the
> data-flow language paradigm.
>
> But I think people have offered many alternatives in terms of scripting
> language that can wrap PigLatin that has loop and flow control constructs.
>
>
> On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <tanton.gibbs@gmail.com
> >wrote:
>
> > It really depends on the depth of your graph.  Are you only dealing
> > with connected components of depth 3 or could they be deeper.
> >
> > For instance, can you have x -> y, y -> z, and z -> a?  If so, do you
> > need your record to be x, y, z, a?  Or, are you guaranteed that it
> > will always be x -> z and y -> z?
> >
> > I really need more information about your data before I can offer too
> > much advice.
> >
> > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> > <re...@gmail.com> wrote:
> > > Hi everybody, thanks a lot for your responses.
> > >
> > > I am actually not looking for a transitive closure, I am not trying to
> > > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that
> > is
> > > in short terms a transitive closure ^^)  because I have the data that
> > proves
> > > it. And yeah I am aware of first logic expressive power, but maybe I
> will
> > > think about giving an inference engine a try some time in the future.
> > >
> > > I am actually looking at a connected graph-like problem. The sample
> > records
> > > resemble a bidirectional triangle.
> > >
> > >            990201
> > >         /              \
> > >        /                \
> > >       /                  \
> > >      /                    \
> > > 770011  ------- 770083
> > >
> > > I tried using a smaller version of the problem by using SQL and got a
> > > horribly huge query which is a non-scalable possibility for me. I have
> > over
> > > 500Gbs in structured files. I used a triple join but I had to retrieve
> > all
> > > existing possibilities.
> > > So that is why I thought on using something like Hc.busy mentioned.
> > >
> > > Queries: idx1, idx2, sc, va, p1, p2
> > > possibilities: possibs = foreach queries generate idx1, flatten(a
> triple
> > > self join of the data)
> > >
> > > and by using the flattening command, get all possible combinations, but
> I
> > am
> > > not sure that would be the correct approach. Anyhow I thought maybe
> > > performing some breadth first search but that would give me an
> algorithm
> > > O(n2)  so  )=
> > > Hey Tanton how would you implement and use an union-find structure? Do
> > you
> > > think it is possible with PIG?
> > >
> > > Thanks again.
> > >
> > >
> > > 2010/6/10 Tanton Gibbs <ta...@gmail.com>
> > >
> > >> To be specific, he's looking for connected components in the graph.
> > >>
> > >> It's not overly easy to do in an ETL tool (or in pig), but if you can
> > >> wrap the script in a loop it is possible.
> > >>
> > >> There are actually some really interesting parallel algorithms for
> > >> finding connected components.  If you know you are only going to have
> > >> two keys, it is a bit simpler to code (but not any more efficient).
> > >> Essentially, you can implement your union-find algorithm as a series
> > >> of sorts and merges, which on a large parallel system is actually
> > >> quite fast.
> > >>
> > >> Tanton
> > >>
> > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
> > >> > What's a transitive closure?
> > >> >
> > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gianmarco.dfm@gmail.com
> >
> > >> wrote:
> > >> >
> > >> >> I think what he wants is a transitive closure of the relation,
> which
> > >> >> is not achievable in SQL-like languages alone (first order logic
> > >> >> expressive power).
> > >> >> I suppose Pig Latin falls in this category.
> > >> >>
> > >> >>
> > >> >> -- Gianmarco
> > >> >>
> > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
> > >> >> > Is this like a tricky interview question? I don't see the pattern
> > >> between
> > >> >> > those three numbers you listed and the sample of the table.
> > >> >> >
> > >> >> > 770011 770083 524 1e-120 89 12
> > >> >> > 770083 770011 494 1e-120 39 100
> > >> >> >
> > >> >> > ahh, I guess these are related because id1=id2 an id2=id1...
> Here's
> > a
> > >> >> first
> > >> >> > pass at the problem. Project:
> > >> >> >
> > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> > >> >> >
> > >> >> > and now J contains pairs of rows from original table where id1
> and
> > id2
> > >> >> are
> > >> >> > reversed.
> > >> >> >
> > >> >> > is this what you want?
> > >> >> >
> > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> > >> >> > renatoj.marroquin@gmail.com> wrote:
> > >> >> >
> > >> >> >> Hi everyone, today I came across with a particular query that I
> > don't
> > >> >> know
> > >> >> >> how to model in PIG. Part of my data looks like this:
> > >> >> >>
> > >> >> >> Id1 Id2 Sc Va P1 P2
> > >> >> >> --------- --------- ----- --------- ----- ----
> > >> >> >> 770011 990201 401 1e-125 100 65
> > >> >> >> 990201 770011 440 1e-125 100 42
> > >> >> >> 770011 770083 524 1e-120 89 12
> > >> >> >> 770083 770011 494 1e-120 39 100
> > >> >> >> 990201 770083 341 1e-125 73 41
> > >> >> >> 770083 990201 421 1e-125 90 85
> > >> >> >> .
> > >> >> >> .
> > >> >> >> .
> > >> >> >>
> > >> >> >> what I would like to retrieve is something like
> > >> >> >> this:                                             770011 990201
> > >> 770083
> > >> >> >> because they are records actually related.
> > >> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> > >> >> >>
> > >> >> >> Renato M.
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: Help with a tricky query

Posted by hc busy <hc...@gmail.com>.

Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
Like you can't really say

T = join Table by id1, Table by id2, Table by id3;

I think PigLatin will complain that it's confused which Table is and which
id1 goes with which table.

I had been proposing that we allow PigLatin to allow recursive functions,
that way we can do loops. But recursive functions doesn't fit in the
data-flow language paradigm.

But I think people have offered many alternatives in terms of scripting
language that can wrap PigLatin that has loop and flow control constructs.


On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <ta...@gmail.com>wrote:

> It really depends on the depth of your graph.  Are you only dealing
> with connected components of depth 3 or could they be deeper.
>
> For instance, can you have x -> y, y -> z, and z -> a?  If so, do you
> need your record to be x, y, z, a?  Or, are you guaranteed that it
> will always be x -> z and y -> z?
>
> I really need more information about your data before I can offer too
> much advice.
>
> On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> <re...@gmail.com> wrote:
> > Hi everybody, thanks a lot for your responses.
> >
> > I am actually not looking for a transitive closure, I am not trying to
> > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that
> is
> > in short terms a transitive closure ^^)  because I have the data that
> proves
> > it. And yeah I am aware of first logic expressive power, but maybe I will
> > think about giving an inference engine a try some time in the future.
> >
> > I am actually looking at a connected graph-like problem. The sample
> records
> > resemble a bidirectional triangle.
> >
> >            990201
> >         /              \
> >        /                \
> >       /                  \
> >      /                    \
> > 770011  ------- 770083
> >
> > I tried using a smaller version of the problem by using SQL and got a
> > horribly huge query which is a non-scalable possibility for me. I have
> over
> > 500Gbs in structured files. I used a triple join but I had to retrieve
> all
> > existing possibilities.
> > So that is why I thought on using something like Hc.busy mentioned.
> >
> > Queries: idx1, idx2, sc, va, p1, p2
> > possibilities: possibs = foreach queries generate idx1, flatten(a triple
> > self join of the data)
> >
> > and by using the flattening command, get all possible combinations, but I
> am
> > not sure that would be the correct approach. Anyhow I thought maybe
> > performing some breadth first search but that would give me an algorithm
> > O(n2)  so  )=
> > Hey Tanton how would you implement and use an union-find structure? Do
> you
> > think it is possible with PIG?
> >
> > Thanks again.
> >
> >
> > 2010/6/10 Tanton Gibbs <ta...@gmail.com>
> >
> >> To be specific, he's looking for connected components in the graph.
> >>
> >> It's not overly easy to do in an ETL tool (or in pig), but if you can
> >> wrap the script in a loop it is possible.
> >>
> >> There are actually some really interesting parallel algorithms for
> >> finding connected components.  If you know you are only going to have
> >> two keys, it is a bit simpler to code (but not any more efficient).
> >> Essentially, you can implement your union-find algorithm as a series
> >> of sorts and merges, which on a large parallel system is actually
> >> quite fast.
> >>
> >> Tanton
> >>
> >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
> >> > What's a transitive closure?
> >> >
> >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gi...@gmail.com>
> >> wrote:
> >> >
> >> >> I think what he wants is a transitive closure of the relation, which
> >> >> is not achievable in SQL-like languages alone (first order logic
> >> >> expressive power).
> >> >> I suppose Pig Latin falls in this category.
> >> >>
> >> >>
> >> >> -- Gianmarco
> >> >>
> >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
> >> >> > Is this like a tricky interview question? I don't see the pattern
> >> between
> >> >> > those three numbers you listed and the sample of the table.
> >> >> >
> >> >> > 770011 770083 524 1e-120 89 12
> >> >> > 770083 770011 494 1e-120 39 100
> >> >> >
> >> >> > ahh, I guess these are related because id1=id2 an id2=id1... Here's
> a
> >> >> first
> >> >> > pass at the problem. Project:
> >> >> >
> >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> >> >> >
> >> >> > and now J contains pairs of rows from original table where id1 and
> id2
> >> >> are
> >> >> > reversed.
> >> >> >
> >> >> > is this what you want?
> >> >> >
> >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> >> >> > renatoj.marroquin@gmail.com> wrote:
> >> >> >
> >> >> >> Hi everyone, today I came across with a particular query that I
> don't
> >> >> know
> >> >> >> how to model in PIG. Part of my data looks like this:
> >> >> >>
> >> >> >> Id1 Id2 Sc Va P1 P2
> >> >> >> --------- --------- ----- --------- ----- ----
> >> >> >> 770011 990201 401 1e-125 100 65
> >> >> >> 990201 770011 440 1e-125 100 42
> >> >> >> 770011 770083 524 1e-120 89 12
> >> >> >> 770083 770011 494 1e-120 39 100
> >> >> >> 990201 770083 341 1e-125 73 41
> >> >> >> 770083 990201 421 1e-125 90 85
> >> >> >> .
> >> >> >> .
> >> >> >> .
> >> >> >>
> >> >> >> what I would like to retrieve is something like
> >> >> >> this:                                             770011 990201
> >> 770083
> >> >> >> because they are records actually related.
> >> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> >> >> >>
> >> >> >> Renato M.
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Help with a tricky query

Posted by Tanton Gibbs <ta...@gmail.com>.

It really depends on the depth of your graph.  Are you only dealing
with connected components of depth 3 or could they be deeper.

For instance, can you have x -> y, y -> z, and z -> a?  If so, do you
need your record to be x, y, z, a?  Or, are you guaranteed that it
will always be x -> z and y -> z?

I really need more information about your data before I can offer too
much advice.

On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> Hi everybody, thanks a lot for your responses.
>
> I am actually not looking for a transitive closure, I am not trying to
> "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
> in short terms a transitive closure ^^)  because I have the data that proves
> it. And yeah I am aware of first logic expressive power, but maybe I will
> think about giving an inference engine a try some time in the future.
>
> I am actually looking at a connected graph-like problem. The sample records
> resemble a bidirectional triangle.
>
>            990201
>         /              \
>        /                \
>       /                  \
>      /                    \
> 770011  ------- 770083
>
> I tried using a smaller version of the problem by using SQL and got a
> horribly huge query which is a non-scalable possibility for me. I have over
> 500Gbs in structured files. I used a triple join but I had to retrieve all
> existing possibilities.
> So that is why I thought on using something like Hc.busy mentioned.
>
> Queries: idx1, idx2, sc, va, p1, p2
> possibilities: possibs = foreach queries generate idx1, flatten(a triple
> self join of the data)
>
> and by using the flattening command, get all possible combinations, but I am
> not sure that would be the correct approach. Anyhow I thought maybe
> performing some breadth first search but that would give me an algorithm
> O(n2)  so  )=
> Hey Tanton how would you implement and use an union-find structure? Do you
> think it is possible with PIG?
>
> Thanks again.
>
>
> 2010/6/10 Tanton Gibbs <ta...@gmail.com>
>
>> To be specific, he's looking for connected components in the graph.
>>
>> It's not overly easy to do in an ETL tool (or in pig), but if you can
>> wrap the script in a loop it is possible.
>>
>> There are actually some really interesting parallel algorithms for
>> finding connected components.  If you know you are only going to have
>> two keys, it is a bit simpler to code (but not any more efficient).
>> Essentially, you can implement your union-find algorithm as a series
>> of sorts and merges, which on a large parallel system is actually
>> quite fast.
>>
>> Tanton
>>
>> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
>> > What's a transitive closure?
>> >
>> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gi...@gmail.com>
>> wrote:
>> >
>> >> I think what he wants is a transitive closure of the relation, which
>> >> is not achievable in SQL-like languages alone (first order logic
>> >> expressive power).
>> >> I suppose Pig Latin falls in this category.
>> >>
>> >>
>> >> -- Gianmarco
>> >>
>> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
>> >> > Is this like a tricky interview question? I don't see the pattern
>> between
>> >> > those three numbers you listed and the sample of the table.
>> >> >
>> >> > 770011 770083 524 1e-120 89 12
>> >> > 770083 770011 494 1e-120 39 100
>> >> >
>> >> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a
>> >> first
>> >> > pass at the problem. Project:
>> >> >
>> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
>> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
>> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
>> >> >
>> >> > and now J contains pairs of rows from original table where id1 and id2
>> >> are
>> >> > reversed.
>> >> >
>> >> > is this what you want?
>> >> >
>> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
>> >> > renatoj.marroquin@gmail.com> wrote:
>> >> >
>> >> >> Hi everyone, today I came across with a particular query that I don't
>> >> know
>> >> >> how to model in PIG. Part of my data looks like this:
>> >> >>
>> >> >> Id1 Id2 Sc Va P1 P2
>> >> >> --------- --------- ----- --------- ----- ----
>> >> >> 770011 990201 401 1e-125 100 65
>> >> >> 990201 770011 440 1e-125 100 42
>> >> >> 770011 770083 524 1e-120 89 12
>> >> >> 770083 770011 494 1e-120 39 100
>> >> >> 990201 770083 341 1e-125 73 41
>> >> >> 770083 990201 421 1e-125 90 85
>> >> >> .
>> >> >> .
>> >> >> .
>> >> >>
>> >> >> what I would like to retrieve is something like
>> >> >> this:                                             770011 990201
>> 770083
>> >> >> because they are records actually related.
>> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
>> >> >>
>> >> >> Renato M.
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: Help with a tricky query

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi everybody, thanks a lot for your responses.

I am actually not looking for a transitive closure, I am not trying to
"infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
in short terms a transitive closure ^^)  because I have the data that proves
it. And yeah I am aware of first logic expressive power, but maybe I will
think about giving an inference engine a try some time in the future.

I am actually looking at a connected graph-like problem. The sample records
resemble a bidirectional triangle.

            990201
         /              \
        /                \
       /                  \
      /                    \
770011  ------- 770083

I tried using a smaller version of the problem by using SQL and got a
horribly huge query which is a non-scalable possibility for me. I have over
500Gbs in structured files. I used a triple join but I had to retrieve all
existing possibilities.
So that is why I thought on using something like Hc.busy mentioned.

Queries: idx1, idx2, sc, va, p1, p2
possibilities: possibs = foreach queries generate idx1, flatten(a triple
self join of the data)

and by using the flattening command, get all possible combinations, but I am
not sure that would be the correct approach. Anyhow I thought maybe
performing some breadth first search but that would give me an algorithm
O(n2)  so  )=
Hey Tanton how would you implement and use an union-find structure? Do you
think it is possible with PIG?

Thanks again.


2010/6/10 Tanton Gibbs <ta...@gmail.com>

> To be specific, he's looking for connected components in the graph.
>
> It's not overly easy to do in an ETL tool (or in pig), but if you can
> wrap the script in a loop it is possible.
>
> There are actually some really interesting parallel algorithms for
> finding connected components.  If you know you are only going to have
> two keys, it is a bit simpler to code (but not any more efficient).
> Essentially, you can implement your union-find algorithm as a series
> of sorts and merges, which on a large parallel system is actually
> quite fast.
>
> Tanton
>
> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
> > What's a transitive closure?
> >
> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gi...@gmail.com>
> wrote:
> >
> >> I think what he wants is a transitive closure of the relation, which
> >> is not achievable in SQL-like languages alone (first order logic
> >> expressive power).
> >> I suppose Pig Latin falls in this category.
> >>
> >>
> >> -- Gianmarco
> >>
> >> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
> >> > Is this like a tricky interview question? I don't see the pattern
> between
> >> > those three numbers you listed and the sample of the table.
> >> >
> >> > 770011 770083 524 1e-120 89 12
> >> > 770083 770011 494 1e-120 39 100
> >> >
> >> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a
> >> first
> >> > pass at the problem. Project:
> >> >
> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> >> >
> >> > and now J contains pairs of rows from original table where id1 and id2
> >> are
> >> > reversed.
> >> >
> >> > is this what you want?
> >> >
> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> >> > renatoj.marroquin@gmail.com> wrote:
> >> >
> >> >> Hi everyone, today I came across with a particular query that I don't
> >> know
> >> >> how to model in PIG. Part of my data looks like this:
> >> >>
> >> >> Id1 Id2 Sc Va P1 P2
> >> >> --------- --------- ----- --------- ----- ----
> >> >> 770011 990201 401 1e-125 100 65
> >> >> 990201 770011 440 1e-125 100 42
> >> >> 770011 770083 524 1e-120 89 12
> >> >> 770083 770011 494 1e-120 39 100
> >> >> 990201 770083 341 1e-125 73 41
> >> >> 770083 990201 421 1e-125 90 85
> >> >> .
> >> >> .
> >> >> .
> >> >>
> >> >> what I would like to retrieve is something like
> >> >> this:                                             770011 990201
> 770083
> >> >> because they are records actually related.
> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> >> >>
> >> >> Renato M.
> >> >>
> >> >
> >>
> >
>

Re: Help with a tricky query

Posted by Tanton Gibbs <ta...@gmail.com>.

To be specific, he's looking for connected components in the graph.

It's not overly easy to do in an ETL tool (or in pig), but if you can
wrap the script in a loop it is possible.

There are actually some really interesting parallel algorithms for
finding connected components.  If you know you are only going to have
two keys, it is a bit simpler to code (but not any more efficient).
Essentially, you can implement your union-find algorithm as a series
of sorts and merges, which on a large parallel system is actually
quite fast.

Tanton

On Thu, Jun 10, 2010 at 3:33 PM, hc busy <hc...@gmail.com> wrote:
> What's a transitive closure?
>
> On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gi...@gmail.com> wrote:
>
>> I think what he wants is a transitive closure of the relation, which
>> is not achievable in SQL-like languages alone (first order logic
>> expressive power).
>> I suppose Pig Latin falls in this category.
>>
>>
>> -- Gianmarco
>>
>> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
>> > Is this like a tricky interview question? I don't see the pattern between
>> > those three numbers you listed and the sample of the table.
>> >
>> > 770011 770083 524 1e-120 89 12
>> > 770083 770011 494 1e-120 39 100
>> >
>> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a
>> first
>> > pass at the problem. Project:
>> >
>> > P1 = foreach table generate id1 as id1, id2 as id2, *;
>> > P2 = foreach table generate id2 as id1, id1 as id2, *;
>> > J = join P1 by (id1, id2), P2 by (id1,id2);
>> >
>> > and now J contains pairs of rows from original table where id1 and id2
>> are
>> > reversed.
>> >
>> > is this what you want?
>> >
>> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
>> > renatoj.marroquin@gmail.com> wrote:
>> >
>> >> Hi everyone, today I came across with a particular query that I don't
>> know
>> >> how to model in PIG. Part of my data looks like this:
>> >>
>> >> Id1 Id2 Sc Va P1 P2
>> >> --------- --------- ----- --------- ----- ----
>> >> 770011 990201 401 1e-125 100 65
>> >> 990201 770011 440 1e-125 100 42
>> >> 770011 770083 524 1e-120 89 12
>> >> 770083 770011 494 1e-120 39 100
>> >> 990201 770083 341 1e-125 73 41
>> >> 770083 990201 421 1e-125 90 85
>> >> .
>> >> .
>> >> .
>> >>
>> >> what I would like to retrieve is something like
>> >> this:                                             770011 990201 770083
>> >> because they are records actually related.
>> >> Any kind of ideas are highly appreciated. Thanks in advanced.
>> >>
>> >> Renato M.
>> >>
>> >
>>
>

Re: Help with a tricky query

Posted by hc busy <hc...@gmail.com>.

What's a transitive closure?

On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <gi...@gmail.com> wrote:

> I think what he wants is a transitive closure of the relation, which
> is not achievable in SQL-like languages alone (first order logic
> expressive power).
> I suppose Pig Latin falls in this category.
>
>
> -- Gianmarco
>
> On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
> > Is this like a tricky interview question? I don't see the pattern between
> > those three numbers you listed and the sample of the table.
> >
> > 770011 770083 524 1e-120 89 12
> > 770083 770011 494 1e-120 39 100
> >
> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a
> first
> > pass at the problem. Project:
> >
> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > J = join P1 by (id1, id2), P2 by (id1,id2);
> >
> > and now J contains pairs of rows from original table where id1 and id2
> are
> > reversed.
> >
> > is this what you want?
> >
> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> > renatoj.marroquin@gmail.com> wrote:
> >
> >> Hi everyone, today I came across with a particular query that I don't
> know
> >> how to model in PIG. Part of my data looks like this:
> >>
> >> Id1 Id2 Sc Va P1 P2
> >> --------- --------- ----- --------- ----- ----
> >> 770011 990201 401 1e-125 100 65
> >> 990201 770011 440 1e-125 100 42
> >> 770011 770083 524 1e-120 89 12
> >> 770083 770011 494 1e-120 39 100
> >> 990201 770083 341 1e-125 73 41
> >> 770083 990201 421 1e-125 90 85
> >> .
> >> .
> >> .
> >>
> >> what I would like to retrieve is something like
> >> this:                                             770011 990201 770083
> >> because they are records actually related.
> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> >>
> >> Renato M.
> >>
> >
>

Re: Help with a tricky query

Posted by Gianmarco <gi...@gmail.com>.

I think what he wants is a transitive closure of the relation, which
is not achievable in SQL-like languages alone (first order logic
expressive power).
I suppose Pig Latin falls in this category.


-- Gianmarco

On Thu, Jun 10, 2010 at 19:54, hc busy <hc...@gmail.com> wrote:
> Is this like a tricky interview question? I don't see the pattern between
> those three numbers you listed and the sample of the table.
>
> 770011 770083 524 1e-120 89 12
> 770083 770011 494 1e-120 39 100
>
> ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
> pass at the problem. Project:
>
> P1 = foreach table generate id1 as id1, id2 as id2, *;
> P2 = foreach table generate id2 as id1, id1 as id2, *;
> J = join P1 by (id1, id2), P2 by (id1,id2);
>
> and now J contains pairs of rows from original table where id1 and id2 are
> reversed.
>
> is this what you want?
>
> On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Hi everyone, today I came across with a particular query that I don't know
>> how to model in PIG. Part of my data looks like this:
>>
>> Id1 Id2 Sc Va P1 P2
>> --------- --------- ----- --------- ----- ----
>> 770011 990201 401 1e-125 100 65
>> 990201 770011 440 1e-125 100 42
>> 770011 770083 524 1e-120 89 12
>> 770083 770011 494 1e-120 39 100
>> 990201 770083 341 1e-125 73 41
>> 770083 990201 421 1e-125 90 85
>> .
>> .
>> .
>>
>> what I would like to retrieve is something like
>> this:                                             770011 990201 770083
>> because they are records actually related.
>> Any kind of ideas are highly appreciated. Thanks in advanced.
>>
>> Renato M.
>>
>

Re: Help with a tricky query

Posted by hc busy <hc...@gmail.com>.

Is this like a tricky interview question? I don't see the pattern between
those three numbers you listed and the sample of the table.

770011 770083 524 1e-120 89 12
770083 770011 494 1e-120 39 100

ahh, I guess these are related because id1=id2 an id2=id1... Here's a first
pass at the problem. Project:

P1 = foreach table generate id1 as id1, id2 as id2, *;
P2 = foreach table generate id2 as id1, id1 as id2, *;
J = join P1 by (id1, id2), P2 by (id1,id2);

and now J contains pairs of rows from original table where id1 and id2 are
reversed.

is this what you want?

On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi everyone, today I came across with a particular query that I don't know
> how to model in PIG. Part of my data looks like this:
>
> Id1 Id2 Sc Va P1 P2
> --------- --------- ----- --------- ----- ----
> 770011 990201 401 1e-125 100 65
> 990201 770011 440 1e-125 100 42
> 770011 770083 524 1e-120 89 12
> 770083 770011 494 1e-120 39 100
> 990201 770083 341 1e-125 73 41
> 770083 990201 421 1e-125 90 85
> .
> .
> .
>
> what I would like to retrieve is something like
> this:                                             770011 990201 770083
> because they are records actually related.
> Any kind of ideas are highly appreciated. Thanks in advanced.
>
> Renato M.
>