You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2008/10/01 22:53:48 UTC
Pig and missing metadata
In the types branch we have changed pig to allow users to specify types
(int, chararray, bag, etc.) when they load their data. We have also
changed the backend to work with data in different types, and cast data
when necessary. But we have sought to maintain the feature that if the
user doesn't tell pig what type the data is, everything will still
work. Given this, there are some semantics we need to clarify and a few
changes that need to be made to support all possible cases.
So now it pig, data can be handled in one of three ways:
1) The data is typed (that is, it's an integer or a chararray or
whatever) and pig knows it because the user has told pig the type. Pig
can be told of the type by the user as part of the script (a = load
'myfile' as (x:int, y:chararray);) or by the load function through
describeSchema or by an eval function via outputSchema.
2) The data is not typed (that is, it's a bytearray). If pig needs to
then convert the data to typed (for example, the user adds an integer to
it) it will depend on the load function that loaded that data to provide
a cast. Pig uses the load function in this case because it has no idea
how data is represented inside a byte array.
3) The data is typed, but pig doesn't know about it. This might be
because neither the user nor the load function told it. It could be
because it's returned from an evaluation function that didn't implement
outputSchema. It could be because there was an operation such as UNION
that can co-mingle data of various types. It could also be because the
data was contained in another datum that may not have been completely
specified (such as a tuple or bag) or could not be completely specified
(like a map). Note that it is legitimate for the user, load function,
or eval function not to inform pig of the type. Perhaps the type
changes from row to row and so it cannot be described in a schema.
In addition, pig now attempts to guess types if the user does not
provide them. So, for a script like
a = load 'myfile' using MyLoader();
b = foreach a generate $0 + 1;
it appears that the user believes $0 to be an integer, so pig will
attempt to convert it to be an integer (or if it happens to already be
one leave it as one).
Case 3 is not yet supported, and supporting it will require some changes
to pig's backend implementation. Specifically it will need to be able
to handle the case where pig guessed that a datum was of one type, but
it turns out to be another. To use the example above, if MyLoader
actually loaded $0 as a double, then pig needs to adapt to this.
In order to handle all of this, we need some semantics that make clear
to users, pig udf developers, and pig developers how pig interacts with
these three types of data. I propose the following semantics:
1) Don't lie to the pig. If users or udf developers tell pig that a
datum is of a certain type (via specification in a script, use of
LoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safe
for pig to assume that datum is of that type. It need not check or
place casts in the pipeline. If the datum is not of that type, than it
is an error, and an error message will be emitted.
2) Pigs fly. We want to choose performance over generality. In the
example above, it is safer to always convert $0 to double, because as
long as $0 is some kind of number you can do the conversion. If $0
really is a double and pig treats it as an int it will be truncating
it. But treating it as an int is 10 times faster than treating it as a
double. And the user can specify it as "$0 + 1.0" if they really want
the double.
3) Pigs eat anything, with reasonable speed. Pig will be able to run
faster in certain cases when it knows the data type. This is
particularly true if the data coming in is typed. On the fly data
conversion will be more expensive than up front knowing the right
types. Plus pig may be able to make better optimization choices when it
knows the the types. But we cannot build the system in a way that
punishes those who do not declare their types, or whose data does not
lend itself to being declared.
4) Pigs are friendly when treated nicely. In the cases where the user
or udf didn't tell pig the type, it isn't an error if the type of the
datum doesn't match the operation. Again, using the example above, if
$0 turns out (at least in some cases) to be a chararray which cannot be
cast to int, then a null datum plus a warning will be emitted rather
than an error.
Thoughts?
Alan.
Re: Pig and missing metadata
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Alan Gates wrote:
>
>
> Mridul Muralidharan wrote:
>> Alan Gates wrote:
>>>
>>>
>>> Case 3 is not yet supported, and supporting it will require some
>>> changes to pig's backend implementation. Specifically it will need
>>> to be able to handle the case where pig guessed that a datum was of
>>> one type, but it turns out to be another. To use the example above,
>>> if MyLoader actually loaded $0 as a double, then pig needs to adapt
>>> to this.
>>
>>
>> union is quite common actually - so some way to handle it would be
>> quite useful.
> We certainly plan to support union fully.
Great ! Thanks for clarifying ... this is something we use extensively
(along with cross).
>>
>>>
>>> 2) Pigs fly. We want to choose performance over generality. In the
>>> example above, it is safer to always convert $0 to double, because as
>>> long as $0 is some kind of number you can do the conversion. If $0
>>> really is a double and pig treats it as an int it will be truncating
>>> it. But treating it as an int is 10 times faster than treating it as
>>> a double. And the user can specify it as "$0 + 1.0" if they really
>>> want the double.
>>
>>
>> I disagree - it is better to treat it as a double, and warn user about
>> the performance implications - than to treat it as an int and generate
>> incorrect results.
>> Correctness is more important than performance.
> This is not a correctness issue. When we are guessing the type, we will
> always be wrong sometimes. If we say $0 + 1 implies an int, and $0 has
> double data then we'll return 3 when the user wanted 3.14. If we say $0
> + 1 is a double and $0 has int data, then we'll return 42.0 when the
> user wanted 42. 42.0 is closer to 42 than 3 is to 3.14, but if the user
> has given us all int data and added an integer to it, and we output
> double data, that's still not what the user wanted.
>
> Given that we will always be wrong sometimes, the question is when do we
> want to be wrong. In this case I advocate in favor of ints for 2 reasons:
>
> 1) Performance, as noted above. Integer computations are about 10x
> faster than double computations.
> 2) Frequency of use. In my experience integral numbers are far more
> common in databases than floating points (obviously this depends on the
> data you're processing).
>
> So 90% of the time we'll produce what the user wants and run 10x faster
> given this assumption, and the other 10% we'll produce a number that
> isn't exactly what the user wanted. If the user wants the double, he
> can explicitly cast $0 or add 1.0 (instead of 1) to it.
A simple snippet which would handle output for both integer and double
is given below. Double.toString() in codepaths which fall under this
category could be replaced with it.
-- start --
NumberFormat nf = NumberFormat.getInstance();
nf.setMinimumFractionDigits(0);
nf.setGroupingUsed(false);
loop :
{
String doubleString = nf.format(<double value>);
}
-- end --
Performance characterstics is definitely worse than in case of directly
using Integer.toString() Double.toString() - but the results will always
be correct.
The snippet above is illustrative - you could hack up something which is
faster & better (even something which delegates to
Long.toString()/Double.toString() depending on input for example).
Implicit assumptions made about user input should always satisfy
principle of least astonishment - even at cost of performance ... imho
performance is always secondary to correctness and functionality.
Warning/error messages to indicate the loss of performance is definitely
required though.
>>
>>> 4) Pigs are friendly when treated nicely. In the cases where the
>>> user or udf didn't tell pig the type, it isn't an error if the type
>>> of the datum doesn't match the operation. Again, using the example
>>> above, if $0 turns out (at least in some cases) to be a chararray
>>> which cannot be cast to int, then a null datum plus a warning will be
>>> emitted rather than an error.
>>
>>
>> This looks more like incompatibility of the input data with the
>> script/udf, no ?
>> For example, if script declares column 1 is integer, and it turns in
>> the file to be chararray, then either :
>> a) it is a schema error in the script - and it is useless to continue
>> the pipeline.
>> b) it is an anomaly in input data.
>> c) space for rent.
>>
>>
>> Different usecases might want to handle (b) differently - (a) is
>> universally something which should result in flagging the script as an
>> error. Not really sure how you will make the distinction between (a)
>> and (b) though ...
>>
> In case 4 here I'm not talking about the situation where the user gave
> us a schema and it turns out to be wrong. That falls under case 1,
> don't lie to the pig. I'm thinking here of situations where the user
> doesn't tell us what the data is or where the data is different row to
> row because of a union or just inconsistent data, which pig does allow.
My assumption about (b) was not a result of incorrect data or incorrect
schema - but due to things like comingling of different tables through
union/etc, udf output, etc - which result in no schema being specified -
and where inference is used.
Not something I normally hit - though if I do, I would prefer exceptions
to silent creeping errors ... though it is quite logical to expect
different behavior too.
My 2cents, ofcourse ymmv :-)
Regards,
Mridul
>
> Alan.
Re: Pig and missing metadata
Posted by Alan Gates <ga...@yahoo-inc.com>.
Mridul Muralidharan wrote:
> Alan Gates wrote:
>>
>>
>> Case 3 is not yet supported, and supporting it will require some
>> changes to pig's backend implementation. Specifically it will need
>> to be able to handle the case where pig guessed that a datum was of
>> one type, but it turns out to be another. To use the example above,
>> if MyLoader actually loaded $0 as a double, then pig needs to adapt
>> to this.
>
>
> union is quite common actually - so some way to handle it would be
> quite useful.
We certainly plan to support union fully.
>
>>
>> 2) Pigs fly. We want to choose performance over generality. In the
>> example above, it is safer to always convert $0 to double, because as
>> long as $0 is some kind of number you can do the conversion. If $0
>> really is a double and pig treats it as an int it will be truncating
>> it. But treating it as an int is 10 times faster than treating it as
>> a double. And the user can specify it as "$0 + 1.0" if they really
>> want the double.
>
>
> I disagree - it is better to treat it as a double, and warn user about
> the performance implications - than to treat it as an int and generate
> incorrect results.
> Correctness is more important than performance.
This is not a correctness issue. When we are guessing the type, we will
always be wrong sometimes. If we say $0 + 1 implies an int, and $0 has
double data then we'll return 3 when the user wanted 3.14. If we say $0
+ 1 is a double and $0 has int data, then we'll return 42.0 when the
user wanted 42. 42.0 is closer to 42 than 3 is to 3.14, but if the user
has given us all int data and added an integer to it, and we output
double data, that's still not what the user wanted.
Given that we will always be wrong sometimes, the question is when do we
want to be wrong. In this case I advocate in favor of ints for 2 reasons:
1) Performance, as noted above. Integer computations are about 10x
faster than double computations.
2) Frequency of use. In my experience integral numbers are far more
common in databases than floating points (obviously this depends on the
data you're processing).
So 90% of the time we'll produce what the user wants and run 10x faster
given this assumption, and the other 10% we'll produce a number that
isn't exactly what the user wanted. If the user wants the double, he
can explicitly cast $0 or add 1.0 (instead of 1) to it.
>
>> 4) Pigs are friendly when treated nicely. In the cases where the
>> user or udf didn't tell pig the type, it isn't an error if the type
>> of the datum doesn't match the operation. Again, using the example
>> above, if $0 turns out (at least in some cases) to be a chararray
>> which cannot be cast to int, then a null datum plus a warning will be
>> emitted rather than an error.
>
>
> This looks more like incompatibility of the input data with the
> script/udf, no ?
> For example, if script declares column 1 is integer, and it turns in
> the file to be chararray, then either :
> a) it is a schema error in the script - and it is useless to continue
> the pipeline.
> b) it is an anomaly in input data.
> c) space for rent.
>
>
> Different usecases might want to handle (b) differently - (a) is
> universally something which should result in flagging the script as an
> error. Not really sure how you will make the distinction between (a)
> and (b) though ...
>
In case 4 here I'm not talking about the situation where the user gave
us a schema and it turns out to be wrong. That falls under case 1,
don't lie to the pig. I'm thinking here of situations where the user
doesn't tell us what the data is or where the data is different row to
row because of a union or just inconsistent data, which pig does allow.
Alan.
Re: Pig and missing metadata
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Alan Gates wrote:
> In the types branch we have changed pig to allow users to specify types
> (int, chararray, bag, etc.) when they load their data. We have also
> changed the backend to work with data in different types, and cast data
> when necessary. But we have sought to maintain the feature that if the
> user doesn't tell pig what type the data is, everything will still
> work. Given this, there are some semantics we need to clarify and a few
> changes that need to be made to support all possible cases.
>
> So now it pig, data can be handled in one of three ways:
>
> 1) The data is typed (that is, it's an integer or a chararray or
> whatever) and pig knows it because the user has told pig the type. Pig
> can be told of the type by the user as part of the script (a = load
> 'myfile' as (x:int, y:chararray);) or by the load function through
> describeSchema or by an eval function via outputSchema.
>
> 2) The data is not typed (that is, it's a bytearray). If pig needs to
> then convert the data to typed (for example, the user adds an integer to
> it) it will depend on the load function that loaded that data to provide
> a cast. Pig uses the load function in this case because it has no idea
> how data is represented inside a byte array.
>
> 3) The data is typed, but pig doesn't know about it. This might be
> because neither the user nor the load function told it. It could be
> because it's returned from an evaluation function that didn't implement
> outputSchema. It could be because there was an operation such as UNION
> that can co-mingle data of various types. It could also be because the
> data was contained in another datum that may not have been completely
> specified (such as a tuple or bag) or could not be completely specified
> (like a map). Note that it is legitimate for the user, load function,
> or eval function not to inform pig of the type. Perhaps the type
> changes from row to row and so it cannot be described in a schema.
>
> In addition, pig now attempts to guess types if the user does not
> provide them. So, for a script like
>
> a = load 'myfile' using MyLoader();
> b = foreach a generate $0 + 1;
>
> it appears that the user believes $0 to be an integer, so pig will
> attempt to convert it to be an integer (or if it happens to already be
> one leave it as one).
>
> Case 3 is not yet supported, and supporting it will require some changes
> to pig's backend implementation. Specifically it will need to be able
> to handle the case where pig guessed that a datum was of one type, but
> it turns out to be another. To use the example above, if MyLoader
> actually loaded $0 as a double, then pig needs to adapt to this.
union is quite common actually - so some way to handle it would be quite
useful.
>
> In order to handle all of this, we need some semantics that make clear
> to users, pig udf developers, and pig developers how pig interacts with
> these three types of data. I propose the following semantics:
>
> 1) Don't lie to the pig. If users or udf developers tell pig that a
> datum is of a certain type (via specification in a script, use of
> LoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safe
> for pig to assume that datum is of that type. It need not check or
> place casts in the pipeline. If the datum is not of that type, than it
> is an error, and an error message will be emitted.
This makes sense. If user (script/udf) declares it of some type, then it
can be expected to be of that type.
>
> 2) Pigs fly. We want to choose performance over generality. In the
> example above, it is safer to always convert $0 to double, because as
> long as $0 is some kind of number you can do the conversion. If $0
> really is a double and pig treats it as an int it will be truncating
> it. But treating it as an int is 10 times faster than treating it as a
> double. And the user can specify it as "$0 + 1.0" if they really want
> the double.
I disagree - it is better to treat it as a double, and warn user about
the performance implications - than to treat it as an int and generate
incorrect results.
Correctness is more important than performance.
>
> 3) Pigs eat anything, with reasonable speed. Pig will be able to run
> faster in certain cases when it knows the data type. This is
> particularly true if the data coming in is typed. On the fly data
> conversion will be more expensive than up front knowing the right
> types. Plus pig may be able to make better optimization choices when it
> knows the the types. But we cannot build the system in a way that
> punishes those who do not declare their types, or whose data does not
> lend itself to being declared.
It is acceptable to punish user in terms of performance penalities (with
suitable warning messages ofcourse) in case there is insufficient info
for pig to optimize .... than being unusable to the user.
In general, the assumption that the udf author, the script snippet
author and the script executor are all the same is not really valid in
non-trivial cases ....
> 4) Pigs are friendly when treated nicely. In the cases where the user
> or udf didn't tell pig the type, it isn't an error if the type of the
> datum doesn't match the operation. Again, using the example above, if
> $0 turns out (at least in some cases) to be a chararray which cannot be
> cast to int, then a null datum plus a warning will be emitted rather
> than an error.
This looks more like incompatibility of the input data with the
script/udf, no ?
For example, if script declares column 1 is integer, and it turns in the
file to be chararray, then either :
a) it is a schema error in the script - and it is useless to continue
the pipeline.
b) it is an anomaly in input data.
c) space for rent.
Different usecases might want to handle (b) differently - (a) is
universally something which should result in flagging the script as an
error. Not really sure how you will make the distinction between (a) and
(b) though ...
Regards,
Mridul
>
> Thoughts?
>
> Alan.