You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Olga Natkovich <ol...@yahoo-inc.com> on 2008/09/17 21:57:49 UTC

Question about semantics of "as" on the load statement

Hi,
 
If I ran the query below (and this is based on actual user query):
 
-- Note that data1 has more than 1 column but as only declares a single
one
A = load 'data1' as (x);
B = load 'data2' as (x, y, z);
C = JOIN A by x, B by x;
D = foreach C generate y,z;
store D into 'output';
 
the current pig implementation produces wrong results. The reason is
that currently load assumes that complete schema is  given to it. The
intention of the user was that (s)he only cares about the first column
as the rest of the data could be thrown out. So in fact, "as" is treated
as project.
 
Do Pig users/developers have a strong opinion on how Pig should handle
this case? If so, please, provide use cases.
 
Thanks,
 
Olga

Re: Question about semantics of "as" on the load statement

Posted by Chris Olston <ol...@yahoo-inc.com>.
Excellent point ... you've changed my mind -- I agree!

-Chris

On Sep 18, 2008, at 11:50 AM, Prashanth Pappu wrote:

> I agree with the ideas in principle.
>
> But projection during load has code/functionality upgrade  
> advantages. And is
> very desirable since
> (a) Chaining of PIG jobs is very common
> (b) Outputs of intermediate PIG jobs (which are used as input to  
> other pig
> jobs) are frequently changed to support new jobs
>
> Here's a more descriptive version of the example. Consider version  
> 1 and
> version 2 of PIG/hadoop based jobs
>
> Version 1 (P1-> P2):
>
> Pig job P1
>> load 'p1-in' as (a,b);
>> ... some processing
>> store  (a,b,c) into 'p1-out';
>
> Pig job P2
>> load 'p1-out' as (a,b,c);
>> ... some processing
>
> Version 2 (P1->P2, P1->P3):
>
> Pig job P1
>> load 'p1-in' as (a,b);
>> ... some processing
>> store (a,b,c,d) into 'p1-out';
>
> Pig job P2
> -- same as version 1
>
> Pig job P3
>> load 'p1-out' as (a,b,c,d);
>> ..some processing
>
> In developing version 2, note that currently all three scripts P1,  
> P2, P3
> have to be changed. In P2 specifically, the 'load' statement has to be
> changed to use the new output schema of P1. But if 'load' were  
> defined to
> only load the first few fields defined in the load-statement, no  
> changes
> have to be made for P2!
>
> I have run into this problem many times before. And the issue is  
> common in
> databases too.
>
> 1. The dictum that "adding fields to an sql table will not break  
> old sql
> queries" is very useful in upgrading the tables to include newer  
> fields.
>
> 2. In PIG, if we can claim that "appending fields to a data file  
> will not
> break old pig scripts" then it will solve many of the upgrade  
> problems.
>
> And in this context, it is useful to limit 'LOAD ... AS to read in  
> only the
> first X fields of the raw log where X is the number of fields in  
> the load
> statement.
>
> Prashanth
>
> On Thu, Sep 18, 2008 at 11:24 AM, Chris Olston <olston@yahoo- 
> inc.com> wrote:
>
>> I don't like the idea that there are two separate mechanisms to do
>> projection of unwanted fields.
>>
>> I prefer:
>>  * LOAD ... AS has to give the full schema (we can even consider  
>> enforcing
>> this at run-time, if it's not too expensive ... and I suspect it's  
>> not)
>>  * if you want to project you do FOREACH ... GENERATE <list of  
>> fields you
>> want to retain>
>>
>> Besides, the purpose of AS is to enable referring to fields by  
>> name rather
>> than by position, but if you start using AS for projection then  
>> you're
>> projecting by position (i.e., only retaining a K-prefix of the  
>> fields),
>> which seems yucky.
>>
>> The downside to my approach is that if you have 100 fields but you  
>> only
>> want the first one, you have to tediously list them all in the  
>> LOAD command,
>> only to drop them right after. But in the long run the Pig project  
>> intends
>> to introduce stored schemas, and we envision that for data with  
>> more than a
>> handful of columns people will use stored schemas, and only use on- 
>> the-fly
>> schemas for very simple data sets for which stored schemas may be  
>> overkill
>> and exacerbate users (e.g., a unary relation that simply lists a  
>> bunch of
>> companies; or a graph represented as a binary (source, destination)
>> relation).
>>
>> -Chris
>>
>>
>>
>> On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:
>>
>>  I think loading only the first column and throwing away the rest  
>> of the
>>> data
>>> is better.
>>>
>>> Here's my primary use-case:
>>>
>>> I often chain pig-jobs. So say, p2 uses 'load' to consume the  
>>> output of p1
>>> (saved with 'store').
>>> Now, if we want p1 to dump more fields that are useful for a  
>>> third job p3,
>>> currently, we're required to change p2's code (load statement
>>> specifically).
>>> But ideally, I just want to append the newer fields to p1's old  
>>> schema and
>>> have p2's load statement working without any changes.
>>>
>>> Prashanth
>>> On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <olgan@yahoo- 
>>> inc.com
>>>> wrote:
>>>
>>>  Hi,
>>>>
>>>> If I ran the query below (and this is based on actual user query):
>>>>
>>>> -- Note that data1 has more than 1 column but as only declares a  
>>>> single
>>>> one
>>>> A = load 'data1' as (x);
>>>> B = load 'data2' as (x, y, z);
>>>> C = JOIN A by x, B by x;
>>>> D = foreach C generate y,z;
>>>> store D into 'output';
>>>>
>>>> the current pig implementation produces wrong results. The  
>>>> reason is
>>>> that currently load assumes that complete schema is  given to  
>>>> it. The
>>>> intention of the user was that (s)he only cares about the first  
>>>> column
>>>> as the rest of the data could be thrown out. So in fact, "as" is  
>>>> treated
>>>> as project.
>>>>
>>>> Do Pig users/developers have a strong opinion on how Pig should  
>>>> handle
>>>> this case? If so, please, provide use cases.
>>>>
>>>> Thanks,
>>>>
>>>> Olga
>>>>
>>>>
>> --
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>>
>>
>>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Re: Question about semantics of "as" on the load statement

Posted by Prashanth Pappu <pr...@conviva.com>.
I agree with the ideas in principle.

But projection during load has code/functionality upgrade advantages. And is
very desirable since
(a) Chaining of PIG jobs is very common
(b) Outputs of intermediate PIG jobs (which are used as input to other pig
jobs) are frequently changed to support new jobs

Here's a more descriptive version of the example. Consider version 1 and
version 2 of PIG/hadoop based jobs

Version 1 (P1-> P2):

Pig job P1
> load 'p1-in' as (a,b);
> ... some processing
> store  (a,b,c) into 'p1-out';

Pig job P2
> load 'p1-out' as (a,b,c);
>... some processing

Version 2 (P1->P2, P1->P3):

Pig job P1
> load 'p1-in' as (a,b);
> ... some processing
> store (a,b,c,d) into 'p1-out';

Pig job P2
-- same as version 1

Pig job P3
> load 'p1-out' as (a,b,c,d);
> ..some processing

In developing version 2, note that currently all three scripts P1, P2, P3
have to be changed. In P2 specifically, the 'load' statement has to be
changed to use the new output schema of P1. But if 'load' were defined to
only load the first few fields defined in the load-statement, no changes
have to be made for P2!

I have run into this problem many times before. And the issue is common in
databases too.

1. The dictum that "adding fields to an sql table will not break old sql
queries" is very useful in upgrading the tables to include newer fields.

2. In PIG, if we can claim that "appending fields to a data file will not
break old pig scripts" then it will solve many of the upgrade problems.

And in this context, it is useful to limit 'LOAD ... AS to read in only the
first X fields of the raw log where X is the number of fields in the load
statement.

Prashanth

On Thu, Sep 18, 2008 at 11:24 AM, Chris Olston <ol...@yahoo-inc.com> wrote:

> I don't like the idea that there are two separate mechanisms to do
> projection of unwanted fields.
>
> I prefer:
>  * LOAD ... AS has to give the full schema (we can even consider enforcing
> this at run-time, if it's not too expensive ... and I suspect it's not)
>  * if you want to project you do FOREACH ... GENERATE <list of fields you
> want to retain>
>
> Besides, the purpose of AS is to enable referring to fields by name rather
> than by position, but if you start using AS for projection then you're
> projecting by position (i.e., only retaining a K-prefix of the fields),
> which seems yucky.
>
> The downside to my approach is that if you have 100 fields but you only
> want the first one, you have to tediously list them all in the LOAD command,
> only to drop them right after. But in the long run the Pig project intends
> to introduce stored schemas, and we envision that for data with more than a
> handful of columns people will use stored schemas, and only use on-the-fly
> schemas for very simple data sets for which stored schemas may be overkill
> and exacerbate users (e.g., a unary relation that simply lists a bunch of
> companies; or a graph represented as a binary (source, destination)
> relation).
>
> -Chris
>
>
>
> On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:
>
>  I think loading only the first column and throwing away the rest of the
>> data
>> is better.
>>
>> Here's my primary use-case:
>>
>> I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1
>> (saved with 'store').
>> Now, if we want p1 to dump more fields that are useful for a third job p3,
>> currently, we're required to change p2's code (load statement
>> specifically).
>> But ideally, I just want to append the newer fields to p1's old schema and
>> have p2's load statement working without any changes.
>>
>> Prashanth
>> On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <olgan@yahoo-inc.com
>> >wrote:
>>
>>  Hi,
>>>
>>> If I ran the query below (and this is based on actual user query):
>>>
>>> -- Note that data1 has more than 1 column but as only declares a single
>>> one
>>> A = load 'data1' as (x);
>>> B = load 'data2' as (x, y, z);
>>> C = JOIN A by x, B by x;
>>> D = foreach C generate y,z;
>>> store D into 'output';
>>>
>>> the current pig implementation produces wrong results. The reason is
>>> that currently load assumes that complete schema is  given to it. The
>>> intention of the user was that (s)he only cares about the first column
>>> as the rest of the data could be thrown out. So in fact, "as" is treated
>>> as project.
>>>
>>> Do Pig users/developers have a strong opinion on how Pig should handle
>>> this case? If so, please, provide use cases.
>>>
>>> Thanks,
>>>
>>> Olga
>>>
>>>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>

Re: Question about semantics of "as" on the load statement

Posted by Chris Olston <ol...@yahoo-inc.com>.
I don't like the idea that there are two separate mechanisms to do  
projection of unwanted fields.

I prefer:
   * LOAD ... AS has to give the full schema (we can even consider  
enforcing this at run-time, if it's not too expensive ... and I  
suspect it's not)
   * if you want to project you do FOREACH ... GENERATE <list of  
fields you want to retain>

Besides, the purpose of AS is to enable referring to fields by name  
rather than by position, but if you start using AS for projection  
then you're projecting by position (i.e., only retaining a K-prefix  
of the fields), which seems yucky.

The downside to my approach is that if you have 100 fields but you  
only want the first one, you have to tediously list them all in the  
LOAD command, only to drop them right after. But in the long run the  
Pig project intends to introduce stored schemas, and we envision that  
for data with more than a handful of columns people will use stored  
schemas, and only use on-the-fly schemas for very simple data sets  
for which stored schemas may be overkill and exacerbate users (e.g.,  
a unary relation that simply lists a bunch of companies; or a graph  
represented as a binary (source, destination) relation).

-Chris


On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:

> I think loading only the first column and throwing away the rest of  
> the data
> is better.
>
> Here's my primary use-case:
>
> I often chain pig-jobs. So say, p2 uses 'load' to consume the  
> output of p1
> (saved with 'store').
> Now, if we want p1 to dump more fields that are useful for a third  
> job p3,
> currently, we're required to change p2's code (load statement  
> specifically).
> But ideally, I just want to append the newer fields to p1's old  
> schema and
> have p2's load statement working without any changes.
>
> Prashanth
> On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <olgan@yahoo- 
> inc.com>wrote:
>
>> Hi,
>>
>> If I ran the query below (and this is based on actual user query):
>>
>> -- Note that data1 has more than 1 column but as only declares a  
>> single
>> one
>> A = load 'data1' as (x);
>> B = load 'data2' as (x, y, z);
>> C = JOIN A by x, B by x;
>> D = foreach C generate y,z;
>> store D into 'output';
>>
>> the current pig implementation produces wrong results. The reason is
>> that currently load assumes that complete schema is  given to it. The
>> intention of the user was that (s)he only cares about the first  
>> column
>> as the rest of the data could be thrown out. So in fact, "as" is  
>> treated
>> as project.
>>
>> Do Pig users/developers have a strong opinion on how Pig should  
>> handle
>> this case? If so, please, provide use cases.
>>
>> Thanks,
>>
>> Olga
>>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Re: Question about semantics of "as" on the load statement

Posted by Prashanth Pappu <pr...@conviva.com>.
I think loading only the first column and throwing away the rest of the data
is better.

Here's my primary use-case:

I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1
(saved with 'store').
Now, if we want p1 to dump more fields that are useful for a third job p3,
currently, we're required to change p2's code (load statement specifically).
But ideally, I just want to append the newer fields to p1's old schema and
have p2's load statement working without any changes.

Prashanth
On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <ol...@yahoo-inc.com>wrote:

> Hi,
>
> If I ran the query below (and this is based on actual user query):
>
> -- Note that data1 has more than 1 column but as only declares a single
> one
> A = load 'data1' as (x);
> B = load 'data2' as (x, y, z);
> C = JOIN A by x, B by x;
> D = foreach C generate y,z;
> store D into 'output';
>
> the current pig implementation produces wrong results. The reason is
> that currently load assumes that complete schema is  given to it. The
> intention of the user was that (s)he only cares about the first column
> as the rest of the data could be thrown out. So in fact, "as" is treated
> as project.
>
> Do Pig users/developers have a strong opinion on how Pig should handle
> this case? If so, please, provide use cases.
>
> Thanks,
>
> Olga
>

Re: Question about semantics of "as" on the load statement

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Olga Natkovich wrote:
> Hi,
>  
> If I ran the query below (and this is based on actual user query):
>  
> -- Note that data1 has more than 1 column but as only declares a single
> one
> A = load 'data1' as (x);
> B = load 'data2' as (x, y, z);
> C = JOIN A by x, B by x;
> D = foreach C generate y,z;
> store D into 'output';
>  
> the current pig implementation produces wrong results. The reason is
> that currently load assumes that complete schema is  given to it. The
> intention of the user was that (s)he only cares about the first column
> as the rest of the data could be thrown out. So in fact, "as" is treated
> as project.
>  
> Do Pig users/developers have a strong opinion on how Pig should handle
> this case? If so, please, provide use cases.

If you look at the usecases enabled by each :

a) If the intention is to restrict the fields to what is specified in 
the schema, then a project following the load would do that for the user 
- the implicit project is just doing the same. So not supporting this 
requirement would not hamper expressibility or usability.

b) If the intention is to 'use' the fields specified in schema in the 
script - but leave the other as-is : to be propogated all the way to 
output (which might be processed by some other program/script), then a 
restrictive load would make this usecase near-impossible (unless users 
stop using schema - not sure how pig2.0 behaves in that case).


Regards,
Mridul


>  
> Thanks,
>  
> Olga
>