You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2010/12/28 23:08:53 UTC

Possible deficiency in describe?

So, I made a dumb little python script that parses a pig script, see's what
stores there are, and then uses pig's describe function to get the schema of
the object being stored and then uses that info to make a new file that has
the proper loader/schema. I felt this was useful because I found myself
making intermediate stores, and then it being pretty difficult to make the
proper loader if there were a lot of columns (especially remembering the
type).

However, it seems that the result from DESCRIBE is not adequate to do a
load. For example, I have test.txt which is literally just random pairs of
numbers

ie

1 2
1 3
1 4
2 5
2 6
3 7
3 8
4 9
5 10
6 11
7 12
8 13
8 14
8 15

and so on.

I do this:

t1 = LOAD 'test.txt' AS (n1:int, n2:int);
t2 = GROUP t1 BY n1;
t3 = GROUP t2 BY group;

DESCRIBE t3;
STORE t3 INTO 'output.txt';

The query runs without a hitch, however, there is an issue

This is what describe gives:

t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}

However, this won't let you load the file...

the output has form
x{(y,{(a,b)}

And I'm not really sure how to go about even creating a loader that would
properly load it. Suffice it to say, it seems pretty complicated to store
and then load anything that isn't a flat file...is this by design? Is there
an easier way to go from the schema, as per describe, to the schema you'd
use to load it?

I'm curious what people do in practice. I could probably extend the script I
made to go from describe schema -> loading schema (if the pig loader can
load things that have brackets and all that?), but I want to know what the
limitations are.

As always, I apologize if there is an easy answer to this. Thanks.

Re: Possible deficiency in describe?

Posted by Thejas M Nair <te...@yahoo-inc.com>.

BinStorage format should not change between pig versions. It is like an interface, it should not change unless there is a very strong reason.
It used to be the format used to (de)serialize data between pig stages, but when changes were made to optimize the format as part of jira PIG-1472, a new format/loader was used instead of changing BinStorage.

-Thejas




On 12/28/10 3:41 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

BinStorage is more efficient and doesn't have the trouble with nested data
representations you encountered in PigStorage. The downside is only that
it's not human-readable, and that it might change between versions of Pig
(though so far we have resisted the urge, iirc)

D

On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Thanks. Is there any particular downside to this if you get to the millions
> and hundreds of millions of rows, or is it just the lack of simple use with
> nonpig systems?
>
> Sent via BlackBerry
>
> -----Original Message-----
> From: Dmitriy Ryaboy <dv...@gmail.com>
> Date: Tue, 28 Dec 2010 15:08:15
> To: <us...@pig.apache.org>
> Reply-To: user@pig.apache.org
> Subject: Re: Possible deficiency in describe?
>
> Try using BinStorage instead of the text-based PigStorage
>
> D
>
> On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > So, I made a dumb little python script that parses a pig script, see's
> what
> > stores there are, and then uses pig's describe function to get the schema
> > of
> > the object being stored and then uses that info to make a new file that
> has
> > the proper loader/schema. I felt this was useful because I found myself
> > making intermediate stores, and then it being pretty difficult to make
> the
> > proper loader if there were a lot of columns (especially remembering the
> > type).
> >
> > However, it seems that the result from DESCRIBE is not adequate to do a
> > load. For example, I have test.txt which is literally just random pairs
> of
> > numbers
> >
> > ie
> >
> > 1 2
> > 1 3
> > 1 4
> > 2 5
> > 2 6
> > 3 7
> > 3 8
> > 4 9
> > 5 10
> > 6 11
> > 7 12
> > 8 13
> > 8 14
> > 8 15
> >
> > and so on.
> >
> > I do this:
> >
> > t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> > t2 = GROUP t1 BY n1;
> > t3 = GROUP t2 BY group;
> >
> > DESCRIBE t3;
> > STORE t3 INTO 'output.txt';
> >
> > The query runs without a hitch, however, there is an issue
> >
> > This is what describe gives:
> >
> > t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
> >
> > However, this won't let you load the file...
> >
> > the output has form
> > x{(y,{(a,b)}
> >
> > And I'm not really sure how to go about even creating a loader that would
> > properly load it. Suffice it to say, it seems pretty complicated to store
> > and then load anything that isn't a flat file...is this by design? Is
> there
> > an easier way to go from the schema, as per describe, to the schema you'd
> > use to load it?
> >
> > I'm curious what people do in practice. I could probably extend the
> script
> > I
> > made to go from describe schema -> loading schema (if the pig loader can
> > load things that have brackets and all that?), but I want to know what
> the
> > limitations are.
> >
> > As always, I apologize if there is an easy answer to this. Thanks.
> >
>
>

Re: Possible deficiency in describe?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

BinStorage is more efficient and doesn't have the trouble with nested data
representations you encountered in PigStorage. The downside is only that
it's not human-readable, and that it might change between versions of Pig
(though so far we have resisted the urge, iirc)

D

On Tue, Dec 28, 2010 at 3:24 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Thanks. Is there any particular downside to this if you get to the millions
> and hundreds of millions of rows, or is it just the lack of simple use with
> nonpig systems?
>
> Sent via BlackBerry
>
> -----Original Message-----
> From: Dmitriy Ryaboy <dv...@gmail.com>
> Date: Tue, 28 Dec 2010 15:08:15
> To: <us...@pig.apache.org>
> Reply-To: user@pig.apache.org
> Subject: Re: Possible deficiency in describe?
>
> Try using BinStorage instead of the text-based PigStorage
>
> D
>
> On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > So, I made a dumb little python script that parses a pig script, see's
> what
> > stores there are, and then uses pig's describe function to get the schema
> > of
> > the object being stored and then uses that info to make a new file that
> has
> > the proper loader/schema. I felt this was useful because I found myself
> > making intermediate stores, and then it being pretty difficult to make
> the
> > proper loader if there were a lot of columns (especially remembering the
> > type).
> >
> > However, it seems that the result from DESCRIBE is not adequate to do a
> > load. For example, I have test.txt which is literally just random pairs
> of
> > numbers
> >
> > ie
> >
> > 1 2
> > 1 3
> > 1 4
> > 2 5
> > 2 6
> > 3 7
> > 3 8
> > 4 9
> > 5 10
> > 6 11
> > 7 12
> > 8 13
> > 8 14
> > 8 15
> >
> > and so on.
> >
> > I do this:
> >
> > t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> > t2 = GROUP t1 BY n1;
> > t3 = GROUP t2 BY group;
> >
> > DESCRIBE t3;
> > STORE t3 INTO 'output.txt';
> >
> > The query runs without a hitch, however, there is an issue
> >
> > This is what describe gives:
> >
> > t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
> >
> > However, this won't let you load the file...
> >
> > the output has form
> > x{(y,{(a,b)}
> >
> > And I'm not really sure how to go about even creating a loader that would
> > properly load it. Suffice it to say, it seems pretty complicated to store
> > and then load anything that isn't a flat file...is this by design? Is
> there
> > an easier way to go from the schema, as per describe, to the schema you'd
> > use to load it?
> >
> > I'm curious what people do in practice. I could probably extend the
> script
> > I
> > made to go from describe schema -> loading schema (if the pig loader can
> > load things that have brackets and all that?), but I want to know what
> the
> > limitations are.
> >
> > As always, I apologize if there is an easy answer to this. Thanks.
> >
>
>

Re: Possible deficiency in describe?

Posted by Jonathan Coveney <jc...@gmail.com>.

Thanks. Is there any particular downside to this if you get to the millions and hundreds of millions of rows, or is it just the lack of simple use with nonpig systems?

Sent via BlackBerry

-----Original Message-----
From: Dmitriy Ryaboy <dv...@gmail.com>
Date: Tue, 28 Dec 2010 15:08:15 
To: <us...@pig.apache.org>
Reply-To: user@pig.apache.org
Subject: Re: Possible deficiency in describe?

Try using BinStorage instead of the text-based PigStorage

D

On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> So, I made a dumb little python script that parses a pig script, see's what
> stores there are, and then uses pig's describe function to get the schema
> of
> the object being stored and then uses that info to make a new file that has
> the proper loader/schema. I felt this was useful because I found myself
> making intermediate stores, and then it being pretty difficult to make the
> proper loader if there were a lot of columns (especially remembering the
> type).
>
> However, it seems that the result from DESCRIBE is not adequate to do a
> load. For example, I have test.txt which is literally just random pairs of
> numbers
>
> ie
>
> 1 2
> 1 3
> 1 4
> 2 5
> 2 6
> 3 7
> 3 8
> 4 9
> 5 10
> 6 11
> 7 12
> 8 13
> 8 14
> 8 15
>
> and so on.
>
> I do this:
>
> t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> t2 = GROUP t1 BY n1;
> t3 = GROUP t2 BY group;
>
> DESCRIBE t3;
> STORE t3 INTO 'output.txt';
>
> The query runs without a hitch, however, there is an issue
>
> This is what describe gives:
>
> t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
>
> However, this won't let you load the file...
>
> the output has form
> x{(y,{(a,b)}
>
> And I'm not really sure how to go about even creating a loader that would
> properly load it. Suffice it to say, it seems pretty complicated to store
> and then load anything that isn't a flat file...is this by design? Is there
> an easier way to go from the schema, as per describe, to the schema you'd
> use to load it?
>
> I'm curious what people do in practice. I could probably extend the script
> I
> made to go from describe schema -> loading schema (if the pig loader can
> load things that have brackets and all that?), but I want to know what the
> limitations are.
>
> As always, I apologize if there is an easy answer to this. Thanks.
>

Re: Possible deficiency in describe?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Try using BinStorage instead of the text-based PigStorage

D

On Tue, Dec 28, 2010 at 2:08 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> So, I made a dumb little python script that parses a pig script, see's what
> stores there are, and then uses pig's describe function to get the schema
> of
> the object being stored and then uses that info to make a new file that has
> the proper loader/schema. I felt this was useful because I found myself
> making intermediate stores, and then it being pretty difficult to make the
> proper loader if there were a lot of columns (especially remembering the
> type).
>
> However, it seems that the result from DESCRIBE is not adequate to do a
> load. For example, I have test.txt which is literally just random pairs of
> numbers
>
> ie
>
> 1 2
> 1 3
> 1 4
> 2 5
> 2 6
> 3 7
> 3 8
> 4 9
> 5 10
> 6 11
> 7 12
> 8 13
> 8 14
> 8 15
>
> and so on.
>
> I do this:
>
> t1 = LOAD 'test.txt' AS (n1:int, n2:int);
> t2 = GROUP t1 BY n1;
> t3 = GROUP t2 BY group;
>
> DESCRIBE t3;
> STORE t3 INTO 'output.txt';
>
> The query runs without a hitch, however, there is an issue
>
> This is what describe gives:
>
> t3: {group: int,t2: {group: int,t1: {n1: int,n2: int}}}
>
> However, this won't let you load the file...
>
> the output has form
> x{(y,{(a,b)}
>
> And I'm not really sure how to go about even creating a loader that would
> properly load it. Suffice it to say, it seems pretty complicated to store
> and then load anything that isn't a flat file...is this by design? Is there
> an easier way to go from the schema, as per describe, to the schema you'd
> use to load it?
>
> I'm curious what people do in practice. I could probably extend the script
> I
> made to go from describe schema -> loading schema (if the pig loader can
> load things that have brackets and all that?), but I want to know what the
> limitations are.
>
> As always, I apologize if there is an easy answer to this. Thanks.
>