You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2010/05/07 21:21:18 UTC

Re: Having UNION'd several datasets (and lost schema), how can I re-cast to a bag?

I have tried to UNION the results before the group, and I have found that
once I UNION, I can never recreate a schema.  Is this a bug?

> DESCRIBE thing1;
thing1: {name: chararray,property1: chararray,property2: double}
> DESCRIBE thing2;
thing2: {name: chararray,property1: chararray,property2: double}

combined_things = UNION thing1, thing2;
> DESCRIBE combined_things;
Schema for combined_things unknown.

> DUMP combined_things;
Output is fine!

> combined_things = FOREACH combined_things GENERATE $0 AS name:chararray,
$1 AS property1:chararray, $2 AS property2:double;
2010-05-07 11:49:17,015 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray.
Other Field Schema: given: chararray

> combined_things = FOREACH combined_things GENERATE (chararray)$0 AS
name:chararray, (chararray)$1 AS property1:chararray, (double )$2 AS
property2:double;
2010-05-07 11:52:53,305 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2999: Unexpected internal error.
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lang.Error

My schema is gone, and I can never ever have it back because I have unioned?
 Is that a bug, or is this the intended behavior?

Russ

On Thu, May 6, 2010 at 5:22 PM, Russell Jurney <ru...@gmail.com>wrote:

> I have a bunch of grouped datasets that I need to union and store.  When I
> union them, they lose their schema.  I need the schema for my output storage
> function to work.  How do I recreate my a schema with a bag of tuples in it
> with a GENERATE/AS?
>
> The schema of each union'd source (all the same) was: g_records: {key:
> chararray,values: {A2: chararray,A3: double}}
>
> Code:
>
> ------------------
>
> records = LOAD 'records' USING PigStorage('\t') AS (A1:chararray,
> A2:chararray, A3:double);
> g_records = GROUP records BY A1;
> g_records = FOREACH g_records GENERATE $0 AS key:chararray, $1 AS values;
> g_records = FOREACH g_records GENERATE key, values.(A2, A3);
>
> > DESCRIBE g_records: {key: chararray,values: {A2: chararray,A3: double}}
>
> all_g_records = UNION g_records, g_records_2, g_records_3, g_records_4;
>
> /* Problem for me: */
> > DESCRIBE all_g_records: Schema for all_g_records unknown.
>
> output_records = FOREACH all_g_records GENERATE $0 AS key:chararray, $1 AS
> values:bag []  # errr... how?
>
> ------------------
>
> Thanks!
>
> Russell Jurney
> russell.jurney@gmail.com
>

Re: Having UNION'd several datasets (and lost schema), how can I re-cast to a bag?

Posted by Russell Jurney <ru...@gmail.com>.
I have my own branch, which differs even from LinkedIn's branch.  I'll look
at updating to something newer.

On Fri, May 7, 2010 at 1:16 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I am on the official release of 0.6 with a few patches, mostly adding
> piggybank stuff.
> You can grab my particular blend from the pig-twttr branch on my
> github fork. It's all existing patches, just a bit of occasional
> backporting.
> No guarantees regarding stability, for that you need to hit up Cloudera
> :-).
>
> -D
>
> On Fri, May 7, 2010 at 1:04 PM, Russell Jurney <ru...@gmail.com>
> wrote:
> > Thanks, I'm on Apache Pig version 0.6.1-dev (rexported).  Perhaps I
> should
> > upgrade!
> >
> > I was able to code my way out of the schema black hole with this:
> >
> > all_things = UNION thing1, thing2, ...;
> > all_things = FOREACH all_things GENERATE $0 AS field1, $1 AS field2, $2
> > AS field3;
> > all_things = FOREACH all_things GENERATE (chararray) field1 AS
> > field1:chararray, (chararray)field2 AS field2  chararray,
> > (double)field3 AS field3:double;
> >
> > Apparently in 0.6 you could cast to a named bytearray, then cast that
> > bytearray to any named type.
> >
> > Russ
> >
> > On Fri, May 7, 2010 at 1:00 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> What version of pig are you using?
> >>
> >> [dmitriy@sjc1j039 ~]$ pig -x local
> >> 2010-05-07 19:58:12,905 [main] INFO  org.apache.pig.Main - Logging
> >> error messages to: /var/log/pig/pig_1273262292904.log
> >> grunt> set1 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
> >> grunt> set2 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
> >> grunt> describe set1;
> >> set1: {a: chararray,b: int,c: int}
> >> grunt> describe set2;
> >> set2: {a: chararray,b: int,c: int}
> >> grunt> unioned = union set1, set2;
> >> grunt> describe unioned;
> >> unioned: {a: chararray,b: int,c: int}
> >>
> >>
> >> On Fri, May 7, 2010 at 12:21 PM, Russell Jurney
> >> <ru...@gmail.com> wrote:
> >> > I have tried to UNION the results before the group, and I have found
> that
> >> > once I UNION, I can never recreate a schema.  Is this a bug?
> >> >
> >> >> DESCRIBE thing1;
> >> > thing1: {name: chararray,property1: chararray,property2: double}
> >> >> DESCRIBE thing2;
> >> > thing2: {name: chararray,property1: chararray,property2: double}
> >> >
> >> > combined_things = UNION thing1, thing2;
> >> >> DESCRIBE combined_things;
> >> > Schema for combined_things unknown.
> >> >
> >> >> DUMP combined_things;
> >> > Output is fine!
> >> >
> >> >> combined_things = FOREACH combined_things GENERATE $0 AS
> name:chararray,
> >> > $1 AS property1:chararray, $2 AS property2:double;
> >> > 2010-05-07 11:49:17,015 [main] ERROR org.apache.pig.tools.grunt.Grunt
> -
> >> > ERROR 1022: Type mismatch merging schema prefix. Field Schema:
> bytearray.
> >> > Other Field Schema: given: chararray
> >> >
> >> >> combined_things = FOREACH combined_things GENERATE (chararray)$0 AS
> >> > name:chararray, (chararray)$1 AS property1:chararray, (double )$2 AS
> >> > property2:double;
> >> > 2010-05-07 11:52:53,305 [main] ERROR org.apache.pig.tools.grunt.Grunt
> -
> >> > ERROR 2999: Unexpected internal error.
> >> > org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
> >> > java.lang.Error
> >> >
> >> > My schema is gone, and I can never ever have it back because I have
> >> unioned?
> >> >  Is that a bug, or is this the intended behavior?
> >> >
> >> > Russ
> >> >
> >> > On Thu, May 6, 2010 at 5:22 PM, Russell Jurney <
> russell.jurney@gmail.com
> >> >wrote:
> >> >
> >> >> I have a bunch of grouped datasets that I need to union and store.
>  When
> >> I
> >> >> union them, they lose their schema.  I need the schema for my output
> >> storage
> >> >> function to work.  How do I recreate my a schema with a bag of tuples
> in
> >> it
> >> >> with a GENERATE/AS?
> >> >>
> >> >> The schema of each union'd source (all the same) was: g_records:
> {key:
> >> >> chararray,values: {A2: chararray,A3: double}}
> >> >>
> >> >> Code:
> >> >>
> >> >> ------------------
> >> >>
> >> >> records = LOAD 'records' USING PigStorage('\t') AS (A1:chararray,
> >> >> A2:chararray, A3:double);
> >> >> g_records = GROUP records BY A1;
> >> >> g_records = FOREACH g_records GENERATE $0 AS key:chararray, $1 AS
> >> values;
> >> >> g_records = FOREACH g_records GENERATE key, values.(A2, A3);
> >> >>
> >> >> > DESCRIBE g_records: {key: chararray,values: {A2: chararray,A3:
> >> double}}
> >> >>
> >> >> all_g_records = UNION g_records, g_records_2, g_records_3,
> g_records_4;
> >> >>
> >> >> /* Problem for me: */
> >> >> > DESCRIBE all_g_records: Schema for all_g_records unknown.
> >> >>
> >> >> output_records = FOREACH all_g_records GENERATE $0 AS key:chararray,
> $1
> >> AS
> >> >> values:bag []  # errr... how?
> >> >>
> >> >> ------------------
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Russell Jurney
> >> >> russell.jurney@gmail.com
> >> >>
> >> >
> >>
> >
>

Re: Having UNION'd several datasets (and lost schema), how can I re-cast to a bag?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I am on the official release of 0.6 with a few patches, mostly adding
piggybank stuff.
You can grab my particular blend from the pig-twttr branch on my
github fork. It's all existing patches, just a bit of occasional
backporting.
No guarantees regarding stability, for that you need to hit up Cloudera :-).

-D

On Fri, May 7, 2010 at 1:04 PM, Russell Jurney <ru...@gmail.com> wrote:
> Thanks, I'm on Apache Pig version 0.6.1-dev (rexported).  Perhaps I should
> upgrade!
>
> I was able to code my way out of the schema black hole with this:
>
> all_things = UNION thing1, thing2, ...;
> all_things = FOREACH all_things GENERATE $0 AS field1, $1 AS field2, $2
> AS field3;
> all_things = FOREACH all_things GENERATE (chararray) field1 AS
> field1:chararray, (chararray)field2 AS field2  chararray,
> (double)field3 AS field3:double;
>
> Apparently in 0.6 you could cast to a named bytearray, then cast that
> bytearray to any named type.
>
> Russ
>
> On Fri, May 7, 2010 at 1:00 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> What version of pig are you using?
>>
>> [dmitriy@sjc1j039 ~]$ pig -x local
>> 2010-05-07 19:58:12,905 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: /var/log/pig/pig_1273262292904.log
>> grunt> set1 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
>> grunt> set2 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
>> grunt> describe set1;
>> set1: {a: chararray,b: int,c: int}
>> grunt> describe set2;
>> set2: {a: chararray,b: int,c: int}
>> grunt> unioned = union set1, set2;
>> grunt> describe unioned;
>> unioned: {a: chararray,b: int,c: int}
>>
>>
>> On Fri, May 7, 2010 at 12:21 PM, Russell Jurney
>> <ru...@gmail.com> wrote:
>> > I have tried to UNION the results before the group, and I have found that
>> > once I UNION, I can never recreate a schema.  Is this a bug?
>> >
>> >> DESCRIBE thing1;
>> > thing1: {name: chararray,property1: chararray,property2: double}
>> >> DESCRIBE thing2;
>> > thing2: {name: chararray,property1: chararray,property2: double}
>> >
>> > combined_things = UNION thing1, thing2;
>> >> DESCRIBE combined_things;
>> > Schema for combined_things unknown.
>> >
>> >> DUMP combined_things;
>> > Output is fine!
>> >
>> >> combined_things = FOREACH combined_things GENERATE $0 AS name:chararray,
>> > $1 AS property1:chararray, $2 AS property2:double;
>> > 2010-05-07 11:49:17,015 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> > ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray.
>> > Other Field Schema: given: chararray
>> >
>> >> combined_things = FOREACH combined_things GENERATE (chararray)$0 AS
>> > name:chararray, (chararray)$1 AS property1:chararray, (double )$2 AS
>> > property2:double;
>> > 2010-05-07 11:52:53,305 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> > ERROR 2999: Unexpected internal error.
>> > org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
>> > java.lang.Error
>> >
>> > My schema is gone, and I can never ever have it back because I have
>> unioned?
>> >  Is that a bug, or is this the intended behavior?
>> >
>> > Russ
>> >
>> > On Thu, May 6, 2010 at 5:22 PM, Russell Jurney <russell.jurney@gmail.com
>> >wrote:
>> >
>> >> I have a bunch of grouped datasets that I need to union and store.  When
>> I
>> >> union them, they lose their schema.  I need the schema for my output
>> storage
>> >> function to work.  How do I recreate my a schema with a bag of tuples in
>> it
>> >> with a GENERATE/AS?
>> >>
>> >> The schema of each union'd source (all the same) was: g_records: {key:
>> >> chararray,values: {A2: chararray,A3: double}}
>> >>
>> >> Code:
>> >>
>> >> ------------------
>> >>
>> >> records = LOAD 'records' USING PigStorage('\t') AS (A1:chararray,
>> >> A2:chararray, A3:double);
>> >> g_records = GROUP records BY A1;
>> >> g_records = FOREACH g_records GENERATE $0 AS key:chararray, $1 AS
>> values;
>> >> g_records = FOREACH g_records GENERATE key, values.(A2, A3);
>> >>
>> >> > DESCRIBE g_records: {key: chararray,values: {A2: chararray,A3:
>> double}}
>> >>
>> >> all_g_records = UNION g_records, g_records_2, g_records_3, g_records_4;
>> >>
>> >> /* Problem for me: */
>> >> > DESCRIBE all_g_records: Schema for all_g_records unknown.
>> >>
>> >> output_records = FOREACH all_g_records GENERATE $0 AS key:chararray, $1
>> AS
>> >> values:bag []  # errr... how?
>> >>
>> >> ------------------
>> >>
>> >> Thanks!
>> >>
>> >> Russell Jurney
>> >> russell.jurney@gmail.com
>> >>
>> >
>>
>

Re: Having UNION'd several datasets (and lost schema), how can I re-cast to a bag?

Posted by Russell Jurney <ru...@gmail.com>.
Thanks, I'm on Apache Pig version 0.6.1-dev (rexported).  Perhaps I should
upgrade!

I was able to code my way out of the schema black hole with this:

all_things = UNION thing1, thing2, ...;
all_things = FOREACH all_things GENERATE $0 AS field1, $1 AS field2, $2
AS field3;
all_things = FOREACH all_things GENERATE (chararray) field1 AS
field1:chararray, (chararray)field2 AS field2  chararray,
(double)field3 AS field3:double;

Apparently in 0.6 you could cast to a named bytearray, then cast that
bytearray to any named type.

Russ

On Fri, May 7, 2010 at 1:00 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> What version of pig are you using?
>
> [dmitriy@sjc1j039 ~]$ pig -x local
> 2010-05-07 19:58:12,905 [main] INFO  org.apache.pig.Main - Logging
> error messages to: /var/log/pig/pig_1273262292904.log
> grunt> set1 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
> grunt> set2 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
> grunt> describe set1;
> set1: {a: chararray,b: int,c: int}
> grunt> describe set2;
> set2: {a: chararray,b: int,c: int}
> grunt> unioned = union set1, set2;
> grunt> describe unioned;
> unioned: {a: chararray,b: int,c: int}
>
>
> On Fri, May 7, 2010 at 12:21 PM, Russell Jurney
> <ru...@gmail.com> wrote:
> > I have tried to UNION the results before the group, and I have found that
> > once I UNION, I can never recreate a schema.  Is this a bug?
> >
> >> DESCRIBE thing1;
> > thing1: {name: chararray,property1: chararray,property2: double}
> >> DESCRIBE thing2;
> > thing2: {name: chararray,property1: chararray,property2: double}
> >
> > combined_things = UNION thing1, thing2;
> >> DESCRIBE combined_things;
> > Schema for combined_things unknown.
> >
> >> DUMP combined_things;
> > Output is fine!
> >
> >> combined_things = FOREACH combined_things GENERATE $0 AS name:chararray,
> > $1 AS property1:chararray, $2 AS property2:double;
> > 2010-05-07 11:49:17,015 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray.
> > Other Field Schema: given: chararray
> >
> >> combined_things = FOREACH combined_things GENERATE (chararray)$0 AS
> > name:chararray, (chararray)$1 AS property1:chararray, (double )$2 AS
> > property2:double;
> > 2010-05-07 11:52:53,305 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 2999: Unexpected internal error.
> > org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
> > java.lang.Error
> >
> > My schema is gone, and I can never ever have it back because I have
> unioned?
> >  Is that a bug, or is this the intended behavior?
> >
> > Russ
> >
> > On Thu, May 6, 2010 at 5:22 PM, Russell Jurney <russell.jurney@gmail.com
> >wrote:
> >
> >> I have a bunch of grouped datasets that I need to union and store.  When
> I
> >> union them, they lose their schema.  I need the schema for my output
> storage
> >> function to work.  How do I recreate my a schema with a bag of tuples in
> it
> >> with a GENERATE/AS?
> >>
> >> The schema of each union'd source (all the same) was: g_records: {key:
> >> chararray,values: {A2: chararray,A3: double}}
> >>
> >> Code:
> >>
> >> ------------------
> >>
> >> records = LOAD 'records' USING PigStorage('\t') AS (A1:chararray,
> >> A2:chararray, A3:double);
> >> g_records = GROUP records BY A1;
> >> g_records = FOREACH g_records GENERATE $0 AS key:chararray, $1 AS
> values;
> >> g_records = FOREACH g_records GENERATE key, values.(A2, A3);
> >>
> >> > DESCRIBE g_records: {key: chararray,values: {A2: chararray,A3:
> double}}
> >>
> >> all_g_records = UNION g_records, g_records_2, g_records_3, g_records_4;
> >>
> >> /* Problem for me: */
> >> > DESCRIBE all_g_records: Schema for all_g_records unknown.
> >>
> >> output_records = FOREACH all_g_records GENERATE $0 AS key:chararray, $1
> AS
> >> values:bag []  # errr... how?
> >>
> >> ------------------
> >>
> >> Thanks!
> >>
> >> Russell Jurney
> >> russell.jurney@gmail.com
> >>
> >
>

Re: Having UNION'd several datasets (and lost schema), how can I re-cast to a bag?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
What version of pig are you using?

[dmitriy@sjc1j039 ~]$ pig -x local
2010-05-07 19:58:12,905 [main] INFO  org.apache.pig.Main - Logging
error messages to: /var/log/pig/pig_1273262292904.log
grunt> set1 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
grunt> set2 = load 'tmp/numbers' as (a:chararray, b:int, c:int);
grunt> describe set1;
set1: {a: chararray,b: int,c: int}
grunt> describe set2;
set2: {a: chararray,b: int,c: int}
grunt> unioned = union set1, set2;
grunt> describe unioned;
unioned: {a: chararray,b: int,c: int}


On Fri, May 7, 2010 at 12:21 PM, Russell Jurney
<ru...@gmail.com> wrote:
> I have tried to UNION the results before the group, and I have found that
> once I UNION, I can never recreate a schema.  Is this a bug?
>
>> DESCRIBE thing1;
> thing1: {name: chararray,property1: chararray,property2: double}
>> DESCRIBE thing2;
> thing2: {name: chararray,property1: chararray,property2: double}
>
> combined_things = UNION thing1, thing2;
>> DESCRIBE combined_things;
> Schema for combined_things unknown.
>
>> DUMP combined_things;
> Output is fine!
>
>> combined_things = FOREACH combined_things GENERATE $0 AS name:chararray,
> $1 AS property1:chararray, $2 AS property2:double;
> 2010-05-07 11:49:17,015 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray.
> Other Field Schema: given: chararray
>
>> combined_things = FOREACH combined_things GENERATE (chararray)$0 AS
> name:chararray, (chararray)$1 AS property1:chararray, (double )$2 AS
> property2:double;
> 2010-05-07 11:52:53,305 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 2999: Unexpected internal error.
> org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
> java.lang.Error
>
> My schema is gone, and I can never ever have it back because I have unioned?
>  Is that a bug, or is this the intended behavior?
>
> Russ
>
> On Thu, May 6, 2010 at 5:22 PM, Russell Jurney <ru...@gmail.com>wrote:
>
>> I have a bunch of grouped datasets that I need to union and store.  When I
>> union them, they lose their schema.  I need the schema for my output storage
>> function to work.  How do I recreate my a schema with a bag of tuples in it
>> with a GENERATE/AS?
>>
>> The schema of each union'd source (all the same) was: g_records: {key:
>> chararray,values: {A2: chararray,A3: double}}
>>
>> Code:
>>
>> ------------------
>>
>> records = LOAD 'records' USING PigStorage('\t') AS (A1:chararray,
>> A2:chararray, A3:double);
>> g_records = GROUP records BY A1;
>> g_records = FOREACH g_records GENERATE $0 AS key:chararray, $1 AS values;
>> g_records = FOREACH g_records GENERATE key, values.(A2, A3);
>>
>> > DESCRIBE g_records: {key: chararray,values: {A2: chararray,A3: double}}
>>
>> all_g_records = UNION g_records, g_records_2, g_records_3, g_records_4;
>>
>> /* Problem for me: */
>> > DESCRIBE all_g_records: Schema for all_g_records unknown.
>>
>> output_records = FOREACH all_g_records GENERATE $0 AS key:chararray, $1 AS
>> values:bag []  # errr... how?
>>
>> ------------------
>>
>> Thanks!
>>
>> Russell Jurney
>> russell.jurney@gmail.com
>>
>