You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rohini Palaniswamy <ro...@gmail.com> on 2014/05/02 20:33:16 UTC

Re: Strange CROSS behavior

This looks like a bug. Can you please file a jira with steps to reproduce?


On Fri, Apr 18, 2014 at 2:45 PM, Alex Rasmussen <al...@trifacta.com>wrote:

> I'm using PigStorage(',') for all stores.
>
> I agree about the expensiveness of CROSS, but I'm still kind of confused as
> to why it would lose records in this case.
>
> --Alex
>
>
> On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota <pradeepg26@gmail.com
> >wrote:
>
> > What is the storage func you're using? My guess is that there is some
> > shared state in the Storage func. Take a look at this SO that is dealing
> > with shared state in Stores.
> >
> >
> http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592
> > .
> > The reason why this doesn't occur is because PigStorage doesn't have
> shared
> > state. So, in T3, you're loading from text files instead of your original
> > store func.
> >
> > CROSS is pretty expensive by nature. If one of your datasets is small
> > enough to load into memory, you use a fragment replicate join instead.
> >
> >
> > On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen <alexras@trifacta.com
> > >wrote:
> >
> > > I'm noticing some really strange behavior with a CROSS operation in one
> > of
> > > my scripts.
> > >
> > > I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
> > > row, and T2 has 2,982,035 rows.
> > >
> > > If I STORE both T1 and T2 before CROSSing them together to get T3, like
> > so:
> > >
> > > -- ... Long script that, among other things, creates T1 and T2 ...
> > > STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
> > > STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
> > > T3 = CROSS T2, T1;
> > >
> > > then I get what I expect; T3 has 2,982,035 records.
> > >
> > > However, if I omit the STOREs and run the CROSS directly, T3 only has
> > > 1,492,977
> > > records.
> > >
> > > I've run EXPLAIN on both the script with the STOREs and the script
> > without,
> > > and their query plans are identical.
> > >
> > > I'm going to end up refactoring the script to get rid of the CROSS
> anyway
> > > since it's expensive, but am curious as to whether I'm doing something
> > > wrong or if there may be a subtle bug in CROSS.
> > >
> > > I'm using Pig version 0.11.0-cdh4.5.0
> > >
> > > Any insight you could give me here would be greatly appreciated.
> > >
> > > Thanks,
> > > --Alex
> > >
> >
>