You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/10/01 22:53:48 UTC

Pig and missing metadata

In the types branch we have changed pig to allow users to specify types 
(int, chararray, bag, etc.) when they load their data.  We have also 
changed the backend to work with data in different types, and cast data 
when necessary.  But we have sought to maintain the feature that if the 
user doesn't tell pig what type the data is, everything will still 
work.  Given this, there are some semantics we need to clarify and a few 
changes that need to be made to support all possible cases.

So now it pig, data can be handled in one of three ways:

1) The data is typed (that is, it's an integer or a chararray or 
whatever) and pig knows it because the user has told pig the type.  Pig 
can be told of the type by the user as part of the script (a = load 
'myfile' as (x:int, y:chararray);) or by the load function through 
describeSchema or by an eval function via outputSchema.

2) The data is not typed (that is, it's a bytearray).  If pig needs to 
then convert the data to typed (for example, the user adds an integer to 
it) it will depend on the load function that loaded that data to provide 
a cast.  Pig uses the load function in this case because it has no idea 
how data is represented inside a byte array.

3) The data is typed, but pig doesn't know about it.  This might be 
because neither the user nor the load function told it.  It could be 
because it's returned from an evaluation function that didn't implement 
outputSchema.  It could be because there was an operation such as UNION 
that can co-mingle data of various types.  It could also be because the 
data was contained in another datum that may not have been completely 
specified (such as a tuple or bag) or could not be completely specified 
(like a map).  Note that it is legitimate for the user, load function, 
or eval function not to inform pig of the type.  Perhaps the type 
changes from row to row and so it cannot be described in a schema.

In addition, pig now attempts to guess types if the user does not 
provide them.  So, for a script like

a = load 'myfile' using MyLoader();
b = foreach a generate $0 + 1;

it appears that the user believes $0 to be an integer, so pig will 
attempt to convert it to be an integer (or if it happens to already be 
one leave it as one).

Case 3 is not yet supported, and supporting it will require some changes 
to pig's backend implementation.  Specifically it will need to be able 
to handle the case where pig guessed that a datum was of one type, but 
it turns out to be another.  To use the example above, if MyLoader 
actually loaded $0 as a double, then pig needs to adapt to this.

In order to handle all of this, we need some semantics that make clear 
to users, pig udf developers, and pig developers how pig interacts with 
these three types of data.  I propose the following semantics:

1) Don't lie to the pig.  If users or udf developers tell pig that a 
datum is of a certain type (via specification in a script, use of 
LoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safe 
for pig to assume that datum is of that type.  It need not check or 
place casts in the pipeline.  If the datum is not of that type, than it 
is an error, and an error message will be emitted.

2) Pigs fly.  We want to choose performance over generality.  In the 
example above, it is safer to always convert $0 to double, because as 
long as $0 is some kind of number you can do the conversion.  If $0 
really is a double and pig treats it as an int it will be truncating 
it.  But treating it as an int is 10 times faster than treating it as a 
double.  And the user can specify it as "$0 + 1.0" if they really want 
the double.

3) Pigs eat anything, with reasonable speed.  Pig will be able to run 
faster in certain cases when it knows the data type.  This is 
particularly true if the data coming in is typed.  On the fly data 
conversion will be more expensive than up front knowing the right 
types.  Plus pig may be able to make better optimization choices when it 
knows the the types.  But we cannot build the system in a way that 
punishes those who do not declare their types, or whose data does not 
lend itself to being declared. 

4) Pigs are friendly when treated nicely.  In the cases where the user 
or udf didn't tell pig the type, it isn't an error if the type of the 
datum doesn't match the operation.  Again, using the example above, if 
$0 turns out (at least in some cases) to be a chararray which cannot be 
cast to int, then a null datum plus a warning will be emitted rather 
than an error.

Thoughts?

Alan.

Re: Pig and missing metadata

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Alan Gates wrote:
> 
> 
> Mridul Muralidharan wrote:
>> Alan Gates wrote:
>>>
>>>
>>> Case 3 is not yet supported, and supporting it will require some 
>>> changes to pig's backend implementation.  Specifically it will need 
>>> to be able to handle the case where pig guessed that a datum was of 
>>> one type, but it turns out to be another.  To use the example above, 
>>> if MyLoader actually loaded $0 as a double, then pig needs to adapt 
>>> to this.
>>
>>
>> union is quite common actually - so some way to handle it would be 
>> quite useful.
> We certainly plan to support union fully.


Great ! Thanks for clarifying ... this is something we use extensively 
(along with cross).


>>
>>>
>>> 2) Pigs fly.  We want to choose performance over generality.  In the 
>>> example above, it is safer to always convert $0 to double, because as 
>>> long as $0 is some kind of number you can do the conversion.  If $0 
>>> really is a double and pig treats it as an int it will be truncating 
>>> it.  But treating it as an int is 10 times faster than treating it as 
>>> a double.  And the user can specify it as "$0 + 1.0" if they really 
>>> want the double.
>>
>>
>> I disagree - it is better to treat it as a double, and warn user about 
>> the performance implications - than to treat it as an int and generate 
>> incorrect results.
>> Correctness is more important than performance.
> This is not a correctness issue.  When we are guessing the type, we will 
> always be wrong sometimes.  If we say $0 + 1 implies an int, and $0 has 
> double data then we'll return 3 when the user wanted 3.14.  If we say $0 
> + 1 is a double and $0 has int data, then we'll return 42.0 when the 
> user wanted 42.  42.0 is closer to 42 than 3 is to 3.14, but if the user 
> has given us all int data and added an integer to it, and we output 
> double data, that's still not what the user wanted.
> 
> Given that we will always be wrong sometimes, the question is when do we 
> want to be wrong.  In this case I advocate in favor of ints for 2 reasons:
> 
> 1) Performance, as noted above.  Integer computations are about 10x 
> faster than double computations.
> 2) Frequency of use.  In my experience integral numbers are far more 
> common in databases than floating points (obviously this depends on the 
> data you're processing).
> 
> So 90% of the time we'll produce what the user wants and run 10x faster 
> given this assumption, and the other 10% we'll produce a number that 
> isn't exactly what the user wanted.  If the user wants the double, he 
> can explicitly cast $0 or add 1.0 (instead of 1) to it.


A simple snippet which would handle output for both integer and double 
is given below. Double.toString() in codepaths which fall under this 
category could be replaced with it.

-- start --
NumberFormat nf = NumberFormat.getInstance();
nf.setMinimumFractionDigits(0);
nf.setGroupingUsed(false);

loop :
{

String doubleString = nf.format(<double value>);
}
-- end --

Performance characterstics is definitely worse than in case of directly 
using Integer.toString() Double.toString() - but the results will always 
be correct.

The snippet above is illustrative - you could hack up something which is 
faster & better (even something which delegates to 
Long.toString()/Double.toString() depending on input for example).


Implicit assumptions made about user input should always satisfy 
principle of least astonishment - even at cost of performance ... imho 
performance is always secondary to correctness and functionality.
Warning/error messages to indicate the loss of performance is definitely 
required though.




>>
>>> 4) Pigs are friendly when treated nicely.  In the cases where the 
>>> user or udf didn't tell pig the type, it isn't an error if the type 
>>> of the datum doesn't match the operation.  Again, using the example 
>>> above, if $0 turns out (at least in some cases) to be a chararray 
>>> which cannot be cast to int, then a null datum plus a warning will be 
>>> emitted rather than an error.
>>
>>
>> This looks more like incompatibility of the input data with the 
>> script/udf, no ?
>> For example, if script declares column 1 is integer, and it turns in 
>> the file to be chararray, then either :
>> a) it is a schema error in the script - and it is useless to continue 
>> the pipeline.
>> b) it is an anomaly in input data.
>> c) space for rent.
>>
>>
>> Different usecases might want to handle (b) differently - (a) is 
>> universally something which should result in flagging the script as an 
>> error. Not really sure how you will make the distinction between (a) 
>> and (b) though ...
>>
> In case 4 here I'm not talking about the situation where the user gave 
> us a schema and it turns out to be wrong.  That falls under case 1, 
> don't lie to the pig.  I'm thinking here of situations where the user 
> doesn't tell us what the data is or where the data is different row to 
> row because of a union or just inconsistent data, which pig does allow.

My assumption about (b) was not a result of incorrect data or incorrect 
schema - but due to things like comingling of different tables through 
union/etc, udf output, etc - which result in no schema being specified - 
and where inference is used.

Not something I normally hit - though if I do, I would prefer exceptions 
to silent creeping errors ... though it is quite logical to expect 
different behavior too.


My 2cents, ofcourse ymmv :-)


Regards,
Mridul

> 
> Alan.


Re: Pig and missing metadata

Posted by Alan Gates <ga...@yahoo-inc.com>.

Mridul Muralidharan wrote:
> Alan Gates wrote:
>>
>>
>> Case 3 is not yet supported, and supporting it will require some 
>> changes to pig's backend implementation.  Specifically it will need 
>> to be able to handle the case where pig guessed that a datum was of 
>> one type, but it turns out to be another.  To use the example above, 
>> if MyLoader actually loaded $0 as a double, then pig needs to adapt 
>> to this.
>
>
> union is quite common actually - so some way to handle it would be 
> quite useful.
We certainly plan to support union fully.
>
>>
>> 2) Pigs fly.  We want to choose performance over generality.  In the 
>> example above, it is safer to always convert $0 to double, because as 
>> long as $0 is some kind of number you can do the conversion.  If $0 
>> really is a double and pig treats it as an int it will be truncating 
>> it.  But treating it as an int is 10 times faster than treating it as 
>> a double.  And the user can specify it as "$0 + 1.0" if they really 
>> want the double.
>
>
> I disagree - it is better to treat it as a double, and warn user about 
> the performance implications - than to treat it as an int and generate 
> incorrect results.
> Correctness is more important than performance.
This is not a correctness issue.  When we are guessing the type, we will 
always be wrong sometimes.  If we say $0 + 1 implies an int, and $0 has 
double data then we'll return 3 when the user wanted 3.14.  If we say $0 
+ 1 is a double and $0 has int data, then we'll return 42.0 when the 
user wanted 42.  42.0 is closer to 42 than 3 is to 3.14, but if the user 
has given us all int data and added an integer to it, and we output 
double data, that's still not what the user wanted.

Given that we will always be wrong sometimes, the question is when do we 
want to be wrong.  In this case I advocate in favor of ints for 2 reasons:

1) Performance, as noted above.  Integer computations are about 10x 
faster than double computations.
2) Frequency of use.  In my experience integral numbers are far more 
common in databases than floating points (obviously this depends on the 
data you're processing).

So 90% of the time we'll produce what the user wants and run 10x faster 
given this assumption, and the other 10% we'll produce a number that 
isn't exactly what the user wanted.  If the user wants the double, he 
can explicitly cast $0 or add 1.0 (instead of 1) to it.
>
>> 4) Pigs are friendly when treated nicely.  In the cases where the 
>> user or udf didn't tell pig the type, it isn't an error if the type 
>> of the datum doesn't match the operation.  Again, using the example 
>> above, if $0 turns out (at least in some cases) to be a chararray 
>> which cannot be cast to int, then a null datum plus a warning will be 
>> emitted rather than an error.
>
>
> This looks more like incompatibility of the input data with the 
> script/udf, no ?
> For example, if script declares column 1 is integer, and it turns in 
> the file to be chararray, then either :
> a) it is a schema error in the script - and it is useless to continue 
> the pipeline.
> b) it is an anomaly in input data.
> c) space for rent.
>
>
> Different usecases might want to handle (b) differently - (a) is 
> universally something which should result in flagging the script as an 
> error. Not really sure how you will make the distinction between (a) 
> and (b) though ...
>
In case 4 here I'm not talking about the situation where the user gave 
us a schema and it turns out to be wrong.  That falls under case 1, 
don't lie to the pig.  I'm thinking here of situations where the user 
doesn't tell us what the data is or where the data is different row to 
row because of a union or just inconsistent data, which pig does allow.

Alan.

Re: Pig and missing metadata

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Alan Gates wrote:
> In the types branch we have changed pig to allow users to specify types 
> (int, chararray, bag, etc.) when they load their data.  We have also 
> changed the backend to work with data in different types, and cast data 
> when necessary.  But we have sought to maintain the feature that if the 
> user doesn't tell pig what type the data is, everything will still 
> work.  Given this, there are some semantics we need to clarify and a few 
> changes that need to be made to support all possible cases.
> 
> So now it pig, data can be handled in one of three ways:
> 
> 1) The data is typed (that is, it's an integer or a chararray or 
> whatever) and pig knows it because the user has told pig the type.  Pig 
> can be told of the type by the user as part of the script (a = load 
> 'myfile' as (x:int, y:chararray);) or by the load function through 
> describeSchema or by an eval function via outputSchema.
> 
> 2) The data is not typed (that is, it's a bytearray).  If pig needs to 
> then convert the data to typed (for example, the user adds an integer to 
> it) it will depend on the load function that loaded that data to provide 
> a cast.  Pig uses the load function in this case because it has no idea 
> how data is represented inside a byte array.
> 
> 3) The data is typed, but pig doesn't know about it.  This might be 
> because neither the user nor the load function told it.  It could be 
> because it's returned from an evaluation function that didn't implement 
> outputSchema.  It could be because there was an operation such as UNION 
> that can co-mingle data of various types.  It could also be because the 
> data was contained in another datum that may not have been completely 
> specified (such as a tuple or bag) or could not be completely specified 
> (like a map).  Note that it is legitimate for the user, load function, 
> or eval function not to inform pig of the type.  Perhaps the type 
> changes from row to row and so it cannot be described in a schema.
> 
> In addition, pig now attempts to guess types if the user does not 
> provide them.  So, for a script like
> 
> a = load 'myfile' using MyLoader();
> b = foreach a generate $0 + 1;
> 
> it appears that the user believes $0 to be an integer, so pig will 
> attempt to convert it to be an integer (or if it happens to already be 
> one leave it as one).
> 
> Case 3 is not yet supported, and supporting it will require some changes 
> to pig's backend implementation.  Specifically it will need to be able 
> to handle the case where pig guessed that a datum was of one type, but 
> it turns out to be another.  To use the example above, if MyLoader 
> actually loaded $0 as a double, then pig needs to adapt to this.


union is quite common actually - so some way to handle it would be quite 
useful.


> 
> In order to handle all of this, we need some semantics that make clear 
> to users, pig udf developers, and pig developers how pig interacts with 
> these three types of data.  I propose the following semantics:
> 
> 1) Don't lie to the pig.  If users or udf developers tell pig that a 
> datum is of a certain type (via specification in a script, use of 
> LoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safe 
> for pig to assume that datum is of that type.  It need not check or 
> place casts in the pipeline.  If the datum is not of that type, than it 
> is an error, and an error message will be emitted.


This makes sense. If user (script/udf) declares it of some type, then it 
can be expected to be of that type.

> 
> 2) Pigs fly.  We want to choose performance over generality.  In the 
> example above, it is safer to always convert $0 to double, because as 
> long as $0 is some kind of number you can do the conversion.  If $0 
> really is a double and pig treats it as an int it will be truncating 
> it.  But treating it as an int is 10 times faster than treating it as a 
> double.  And the user can specify it as "$0 + 1.0" if they really want 
> the double.


I disagree - it is better to treat it as a double, and warn user about 
the performance implications - than to treat it as an int and generate 
incorrect results.
Correctness is more important than performance.


> 
> 3) Pigs eat anything, with reasonable speed.  Pig will be able to run 
> faster in certain cases when it knows the data type.  This is 
> particularly true if the data coming in is typed.  On the fly data 
> conversion will be more expensive than up front knowing the right 
> types.  Plus pig may be able to make better optimization choices when it 
> knows the the types.  But we cannot build the system in a way that 
> punishes those who do not declare their types, or whose data does not 
> lend itself to being declared.

It is acceptable to punish user in terms of performance penalities (with 
suitable warning messages ofcourse) in case there is insufficient info 
for pig to optimize .... than being unusable to the user.
In general, the assumption that the udf author, the script snippet 
author and the script executor are all the same is not really valid in 
non-trivial cases ....


> 4) Pigs are friendly when treated nicely.  In the cases where the user 
> or udf didn't tell pig the type, it isn't an error if the type of the 
> datum doesn't match the operation.  Again, using the example above, if 
> $0 turns out (at least in some cases) to be a chararray which cannot be 
> cast to int, then a null datum plus a warning will be emitted rather 
> than an error.


This looks more like incompatibility of the input data with the 
script/udf, no ?
For example, if script declares column 1 is integer, and it turns in the 
file to be chararray, then either :
a) it is a schema error in the script - and it is useless to continue 
the pipeline.
b) it is an anomaly in input data.
c) space for rent.


Different usecases might want to handle (b) differently - (a) is 
universally something which should result in flagging the script as an 
error. Not really sure how you will make the distinction between (a) and 
(b) though ...



Regards,
Mridul


> 
> Thoughts?
> 
> Alan.