You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by pi song <pi...@gmail.com> on 2008/05/09 16:40:50 UTC

Implicit casting on bag operators

Union is an example of bag (relational) operators that can have more than
one input.

In case that schemas from all the input ports are the same, no problem.
In case that schemas from all the input ports are not compatible, no problem
because we won't process it.
In case that schemas from all the input ports are not the same, but
compatible, here comes a problem.

Example:

C = UNION A,B ;

Schema(A) = < Int, Chararray >
Schema(B) = < Double, Chararray >

The output schema will get resolved to < Double, Chararray >. Here is the
problem. The Union operator at the moment doesn't support casting in any
layer. In this case if we don't cast it, the binary data of Int will get
picked up as Double by the downstream operator!! There are a couple
solutions for this:-

1) Implement LOUnion and POUnion to support type casting internally
2) Add casting support in LOUnion operator and let the LogicalToPhysical
compiler generates LOForeach for it.
3) Explicitly insert LOForEach to do necessary casting between Union and the
problematic input. This is analogous to the way we implement implicit
casting for expression operators.
4) Don't support "not same but compatible" case at all.

I will do (3) because it makes the most sense to me plus incurs the least
impact on other modules. Does anyone have problem with it?

Pi

Re: Implicit casting on bag operators

Posted by pi song <pi...@gmail.com>.
IC

On 5/15/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> I doubt you'll get the votes on setting it on by default.  Pig's founders
> have been fairly adamant that pig continue to work in the no metadata case.
>  Turning this on by default would break that rule.
>
> Alan.
>
> pi song wrote:
>
>> We can have that "strict typing" option in pig.properties and then make
>> the
>> type checking validation consuming that config key. However by default I
>> want to turn it on.
>>
>> Pi
>>
>>
>> On 5/15/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>
>>> I agree this will be somewhat surprising, perhaps we should give a
>>> warning.
>>>  But we need to preserve our philosophy that "Pig's eat anything".  This
>>> would seem to dictate that we allow people to use union regardless of the
>>> schemas.  One open question in my mind is whether we have a "strict mode"
>>> (similar to 'use strict' in perl) where things like this cause errors
>>> instead of (possibly) warnings.
>>>
>>> Alan.
>>>
>>> pi song wrote:
>>>
>>>
>>>
>>>> Alan,
>>>>
>>>> On my second thought, union of two incompatible data streams can cause
>>>> undefined state in downstream operators, resulting in a mix of good
>>>> output
>>>> and garbage. This seems to break the rule of least surprise. What do you
>>>> think?
>>>>
>>>> Pi
>>>>
>>>> On Wed, May 14, 2008 at 9:06 AM, pi song <pi...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Ok, will follow that.
>>>>>
>>>>>
>>>>> On 5/14/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I agree that option 3 is the correct course.
>>>>>>
>>>>>> One note, you say:
>>>>>>
>>>>>> In case that schemas from all the input ports are not compatible, no
>>>>>> problem
>>>>>> because we won't process it.
>>>>>>
>>>>>> How do you mean "won't process it"?  We still have to allow a union
>>>>>> operation between two non-compatible inputs (otherwise we can only use
>>>>>> union
>>>>>> when we have schemas).  But the resulting union will not have a schema
>>>>>> (since the output no longer has a consistent schema).
>>>>>>
>>>>>> Alan.
>>>>>>
>>>>>>
>>>>>> pi song wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Union is an example of bag (relational) operators that can have more
>>>>>>> than
>>>>>>> one input.
>>>>>>>
>>>>>>> In case that schemas from all the input ports are the same, no
>>>>>>> problem.
>>>>>>> In case that schemas from all the input ports are not compatible, no
>>>>>>> problem
>>>>>>> because we won't process it.
>>>>>>> In case that schemas from all the input ports are not the same, but
>>>>>>> compatible, here comes a problem.
>>>>>>>
>>>>>>> Example:
>>>>>>>
>>>>>>> C = UNION A,B ;
>>>>>>>
>>>>>>> Schema(A) = < Int, Chararray >
>>>>>>> Schema(B) = < Double, Chararray >
>>>>>>>
>>>>>>> The output schema will get resolved to < Double, Chararray >. Here is
>>>>>>> the
>>>>>>> problem. The Union operator at the moment doesn't support casting in
>>>>>>> any
>>>>>>> layer. In this case if we don't cast it, the binary data of Int will
>>>>>>> get
>>>>>>> picked up as Double by the downstream operator!! There are a couple
>>>>>>> solutions for this:-
>>>>>>>
>>>>>>> 1) Implement LOUnion and POUnion to support type casting internally
>>>>>>> 2) Add casting support in LOUnion operator and let the
>>>>>>> LogicalToPhysical
>>>>>>> compiler generates LOForeach for it.
>>>>>>> 3) Explicitly insert LOForEach to do necessary casting between Union
>>>>>>> and
>>>>>>> the
>>>>>>> problematic input. This is analogous to the way we implement implicit
>>>>>>> casting for expression operators.
>>>>>>> 4) Don't support "not same but compatible" case at all.
>>>>>>>
>>>>>>> I will do (3) because it makes the most sense to me plus incurs the
>>>>>>> least
>>>>>>> impact on other modules. Does anyone have problem with it?
>>>>>>>
>>>>>>> Pi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>

Re: Implicit casting on bag operators

Posted by Alan Gates <ga...@yahoo-inc.com>.
I doubt you'll get the votes on setting it on by default.  Pig's 
founders have been fairly adamant that pig continue to work in the no 
metadata case.  Turning this on by default would break that rule.

Alan.

pi song wrote:
> We can have that "strict typing" option in pig.properties and then make the
> type checking validation consuming that config key. However by default I
> want to turn it on.
>
> Pi
>
>
> On 5/15/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>   
>> I agree this will be somewhat surprising, perhaps we should give a warning.
>>  But we need to preserve our philosophy that "Pig's eat anything".  This
>> would seem to dictate that we allow people to use union regardless of the
>> schemas.  One open question in my mind is whether we have a "strict mode"
>> (similar to 'use strict' in perl) where things like this cause errors
>> instead of (possibly) warnings.
>>
>> Alan.
>>
>> pi song wrote:
>>
>>     
>>> Alan,
>>>
>>> On my second thought, union of two incompatible data streams can cause
>>> undefined state in downstream operators, resulting in a mix of good output
>>> and garbage. This seems to break the rule of least surprise. What do you
>>> think?
>>>
>>> Pi
>>>
>>> On Wed, May 14, 2008 at 9:06 AM, pi song <pi...@gmail.com> wrote:
>>>
>>>
>>>
>>>       
>>>> Ok, will follow that.
>>>>
>>>>
>>>> On 5/14/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>>>>
>>>>
>>>>         
>>>>> I agree that option 3 is the correct course.
>>>>>
>>>>> One note, you say:
>>>>>
>>>>> In case that schemas from all the input ports are not compatible, no
>>>>> problem
>>>>> because we won't process it.
>>>>>
>>>>> How do you mean "won't process it"?  We still have to allow a union
>>>>> operation between two non-compatible inputs (otherwise we can only use
>>>>> union
>>>>> when we have schemas).  But the resulting union will not have a schema
>>>>> (since the output no longer has a consistent schema).
>>>>>
>>>>> Alan.
>>>>>
>>>>>
>>>>> pi song wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Union is an example of bag (relational) operators that can have more
>>>>>> than
>>>>>> one input.
>>>>>>
>>>>>> In case that schemas from all the input ports are the same, no problem.
>>>>>> In case that schemas from all the input ports are not compatible, no
>>>>>> problem
>>>>>> because we won't process it.
>>>>>> In case that schemas from all the input ports are not the same, but
>>>>>> compatible, here comes a problem.
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> C = UNION A,B ;
>>>>>>
>>>>>> Schema(A) = < Int, Chararray >
>>>>>> Schema(B) = < Double, Chararray >
>>>>>>
>>>>>> The output schema will get resolved to < Double, Chararray >. Here is
>>>>>> the
>>>>>> problem. The Union operator at the moment doesn't support casting in
>>>>>> any
>>>>>> layer. In this case if we don't cast it, the binary data of Int will
>>>>>> get
>>>>>> picked up as Double by the downstream operator!! There are a couple
>>>>>> solutions for this:-
>>>>>>
>>>>>> 1) Implement LOUnion and POUnion to support type casting internally
>>>>>> 2) Add casting support in LOUnion operator and let the
>>>>>> LogicalToPhysical
>>>>>> compiler generates LOForeach for it.
>>>>>> 3) Explicitly insert LOForEach to do necessary casting between Union
>>>>>> and
>>>>>> the
>>>>>> problematic input. This is analogous to the way we implement implicit
>>>>>> casting for expression operators.
>>>>>> 4) Don't support "not same but compatible" case at all.
>>>>>>
>>>>>> I will do (3) because it makes the most sense to me plus incurs the
>>>>>> least
>>>>>> impact on other modules. Does anyone have problem with it?
>>>>>>
>>>>>> Pi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>       
>
>   

Re: Implicit casting on bag operators

Posted by pi song <pi...@gmail.com>.
We can have that "strict typing" option in pig.properties and then make the
type checking validation consuming that config key. However by default I
want to turn it on.

Pi


On 5/15/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> I agree this will be somewhat surprising, perhaps we should give a warning.
>  But we need to preserve our philosophy that "Pig's eat anything".  This
> would seem to dictate that we allow people to use union regardless of the
> schemas.  One open question in my mind is whether we have a "strict mode"
> (similar to 'use strict' in perl) where things like this cause errors
> instead of (possibly) warnings.
>
> Alan.
>
> pi song wrote:
>
>> Alan,
>>
>> On my second thought, union of two incompatible data streams can cause
>> undefined state in downstream operators, resulting in a mix of good output
>> and garbage. This seems to break the rule of least surprise. What do you
>> think?
>>
>> Pi
>>
>> On Wed, May 14, 2008 at 9:06 AM, pi song <pi...@gmail.com> wrote:
>>
>>
>>
>>> Ok, will follow that.
>>>
>>>
>>> On 5/14/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>>>
>>>
>>>> I agree that option 3 is the correct course.
>>>>
>>>> One note, you say:
>>>>
>>>> In case that schemas from all the input ports are not compatible, no
>>>> problem
>>>> because we won't process it.
>>>>
>>>> How do you mean "won't process it"?  We still have to allow a union
>>>> operation between two non-compatible inputs (otherwise we can only use
>>>> union
>>>> when we have schemas).  But the resulting union will not have a schema
>>>> (since the output no longer has a consistent schema).
>>>>
>>>> Alan.
>>>>
>>>>
>>>> pi song wrote:
>>>>
>>>>
>>>>
>>>>> Union is an example of bag (relational) operators that can have more
>>>>> than
>>>>> one input.
>>>>>
>>>>> In case that schemas from all the input ports are the same, no problem.
>>>>> In case that schemas from all the input ports are not compatible, no
>>>>> problem
>>>>> because we won't process it.
>>>>> In case that schemas from all the input ports are not the same, but
>>>>> compatible, here comes a problem.
>>>>>
>>>>> Example:
>>>>>
>>>>> C = UNION A,B ;
>>>>>
>>>>> Schema(A) = < Int, Chararray >
>>>>> Schema(B) = < Double, Chararray >
>>>>>
>>>>> The output schema will get resolved to < Double, Chararray >. Here is
>>>>> the
>>>>> problem. The Union operator at the moment doesn't support casting in
>>>>> any
>>>>> layer. In this case if we don't cast it, the binary data of Int will
>>>>> get
>>>>> picked up as Double by the downstream operator!! There are a couple
>>>>> solutions for this:-
>>>>>
>>>>> 1) Implement LOUnion and POUnion to support type casting internally
>>>>> 2) Add casting support in LOUnion operator and let the
>>>>> LogicalToPhysical
>>>>> compiler generates LOForeach for it.
>>>>> 3) Explicitly insert LOForEach to do necessary casting between Union
>>>>> and
>>>>> the
>>>>> problematic input. This is analogous to the way we implement implicit
>>>>> casting for expression operators.
>>>>> 4) Don't support "not same but compatible" case at all.
>>>>>
>>>>> I will do (3) because it makes the most sense to me plus incurs the
>>>>> least
>>>>> impact on other modules. Does anyone have problem with it?
>>>>>
>>>>> Pi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Implicit casting on bag operators

Posted by Alan Gates <ga...@yahoo-inc.com>.
I agree this will be somewhat surprising, perhaps we should give a 
warning.  But we need to preserve our philosophy that "Pig's eat 
anything".  This would seem to dictate that we allow people to use union 
regardless of the schemas.  One open question in my mind is whether we 
have a "strict mode" (similar to 'use strict' in perl) where things like 
this cause errors instead of (possibly) warnings.

Alan.

pi song wrote:
> Alan,
>
> On my second thought, union of two incompatible data streams can cause
> undefined state in downstream operators, resulting in a mix of good output
> and garbage. This seems to break the rule of least surprise. What do you
> think?
>
> Pi
>
> On Wed, May 14, 2008 at 9:06 AM, pi song <pi...@gmail.com> wrote:
>
>   
>> Ok, will follow that.
>>
>>
>> On 5/14/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>>     
>>> I agree that option 3 is the correct course.
>>>
>>> One note, you say:
>>>
>>> In case that schemas from all the input ports are not compatible, no
>>> problem
>>> because we won't process it.
>>>
>>> How do you mean "won't process it"?  We still have to allow a union
>>> operation between two non-compatible inputs (otherwise we can only use union
>>> when we have schemas).  But the resulting union will not have a schema
>>> (since the output no longer has a consistent schema).
>>>
>>> Alan.
>>>
>>>
>>> pi song wrote:
>>>
>>>       
>>>> Union is an example of bag (relational) operators that can have more than
>>>> one input.
>>>>
>>>> In case that schemas from all the input ports are the same, no problem.
>>>> In case that schemas from all the input ports are not compatible, no
>>>> problem
>>>> because we won't process it.
>>>> In case that schemas from all the input ports are not the same, but
>>>> compatible, here comes a problem.
>>>>
>>>> Example:
>>>>
>>>> C = UNION A,B ;
>>>>
>>>> Schema(A) = < Int, Chararray >
>>>> Schema(B) = < Double, Chararray >
>>>>
>>>> The output schema will get resolved to < Double, Chararray >. Here is the
>>>> problem. The Union operator at the moment doesn't support casting in any
>>>> layer. In this case if we don't cast it, the binary data of Int will get
>>>> picked up as Double by the downstream operator!! There are a couple
>>>> solutions for this:-
>>>>
>>>> 1) Implement LOUnion and POUnion to support type casting internally
>>>> 2) Add casting support in LOUnion operator and let the LogicalToPhysical
>>>> compiler generates LOForeach for it.
>>>> 3) Explicitly insert LOForEach to do necessary casting between Union and
>>>> the
>>>> problematic input. This is analogous to the way we implement implicit
>>>> casting for expression operators.
>>>> 4) Don't support "not same but compatible" case at all.
>>>>
>>>> I will do (3) because it makes the most sense to me plus incurs the least
>>>> impact on other modules. Does anyone have problem with it?
>>>>
>>>> Pi
>>>>
>>>>
>>>>
>>>>         
>
>   

Re: Implicit casting on bag operators

Posted by pi song <pi...@gmail.com>.
Alan,

On my second thought, union of two incompatible data streams can cause
undefined state in downstream operators, resulting in a mix of good output
and garbage. This seems to break the rule of least surprise. What do you
think?

Pi

On Wed, May 14, 2008 at 9:06 AM, pi song <pi...@gmail.com> wrote:

> Ok, will follow that.
>
>
> On 5/14/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>> I agree that option 3 is the correct course.
>>
>> One note, you say:
>>
>> In case that schemas from all the input ports are not compatible, no
>> problem
>> because we won't process it.
>>
>> How do you mean "won't process it"?  We still have to allow a union
>> operation between two non-compatible inputs (otherwise we can only use union
>> when we have schemas).  But the resulting union will not have a schema
>> (since the output no longer has a consistent schema).
>>
>> Alan.
>>
>>
>> pi song wrote:
>>
>>> Union is an example of bag (relational) operators that can have more than
>>> one input.
>>>
>>> In case that schemas from all the input ports are the same, no problem.
>>> In case that schemas from all the input ports are not compatible, no
>>> problem
>>> because we won't process it.
>>> In case that schemas from all the input ports are not the same, but
>>> compatible, here comes a problem.
>>>
>>> Example:
>>>
>>> C = UNION A,B ;
>>>
>>> Schema(A) = < Int, Chararray >
>>> Schema(B) = < Double, Chararray >
>>>
>>> The output schema will get resolved to < Double, Chararray >. Here is the
>>> problem. The Union operator at the moment doesn't support casting in any
>>> layer. In this case if we don't cast it, the binary data of Int will get
>>> picked up as Double by the downstream operator!! There are a couple
>>> solutions for this:-
>>>
>>> 1) Implement LOUnion and POUnion to support type casting internally
>>> 2) Add casting support in LOUnion operator and let the LogicalToPhysical
>>> compiler generates LOForeach for it.
>>> 3) Explicitly insert LOForEach to do necessary casting between Union and
>>> the
>>> problematic input. This is analogous to the way we implement implicit
>>> casting for expression operators.
>>> 4) Don't support "not same but compatible" case at all.
>>>
>>> I will do (3) because it makes the most sense to me plus incurs the least
>>> impact on other modules. Does anyone have problem with it?
>>>
>>> Pi
>>>
>>>
>>>
>>
>

Re: Implicit casting on bag operators

Posted by pi song <pi...@gmail.com>.
Ok, will follow that.

On 5/14/08, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> I agree that option 3 is the correct course.
>
> One note, you say:
>
> In case that schemas from all the input ports are not compatible, no
> problem
> because we won't process it.
>
> How do you mean "won't process it"?  We still have to allow a union
> operation between two non-compatible inputs (otherwise we can only use union
> when we have schemas).  But the resulting union will not have a schema
> (since the output no longer has a consistent schema).
>
> Alan.
>
>
> pi song wrote:
>
> > Union is an example of bag (relational) operators that can have more
> > than
> > one input.
> >
> > In case that schemas from all the input ports are the same, no problem.
> > In case that schemas from all the input ports are not compatible, no
> > problem
> > because we won't process it.
> > In case that schemas from all the input ports are not the same, but
> > compatible, here comes a problem.
> >
> > Example:
> >
> > C = UNION A,B ;
> >
> > Schema(A) = < Int, Chararray >
> > Schema(B) = < Double, Chararray >
> >
> > The output schema will get resolved to < Double, Chararray >. Here is
> > the
> > problem. The Union operator at the moment doesn't support casting in any
> > layer. In this case if we don't cast it, the binary data of Int will get
> > picked up as Double by the downstream operator!! There are a couple
> > solutions for this:-
> >
> > 1) Implement LOUnion and POUnion to support type casting internally
> > 2) Add casting support in LOUnion operator and let the LogicalToPhysical
> > compiler generates LOForeach for it.
> > 3) Explicitly insert LOForEach to do necessary casting between Union and
> > the
> > problematic input. This is analogous to the way we implement implicit
> > casting for expression operators.
> > 4) Don't support "not same but compatible" case at all.
> >
> > I will do (3) because it makes the most sense to me plus incurs the
> > least
> > impact on other modules. Does anyone have problem with it?
> >
> > Pi
> >
> >
> >
>

Re: Implicit casting on bag operators

Posted by Alan Gates <ga...@yahoo-inc.com>.
I agree that option 3 is the correct course.

One note, you say:

In case that schemas from all the input ports are not compatible, no problem
because we won't process it.

How do you mean "won't process it"?  We still have to allow a union 
operation between two non-compatible inputs (otherwise we can only use 
union when we have schemas).  But the resulting union will not have a 
schema (since the output no longer has a consistent schema).

Alan.


pi song wrote:
> Union is an example of bag (relational) operators that can have more than
> one input.
>
> In case that schemas from all the input ports are the same, no problem.
> In case that schemas from all the input ports are not compatible, no problem
> because we won't process it.
> In case that schemas from all the input ports are not the same, but
> compatible, here comes a problem.
>
> Example:
>
> C = UNION A,B ;
>
> Schema(A) = < Int, Chararray >
> Schema(B) = < Double, Chararray >
>
> The output schema will get resolved to < Double, Chararray >. Here is the
> problem. The Union operator at the moment doesn't support casting in any
> layer. In this case if we don't cast it, the binary data of Int will get
> picked up as Double by the downstream operator!! There are a couple
> solutions for this:-
>
> 1) Implement LOUnion and POUnion to support type casting internally
> 2) Add casting support in LOUnion operator and let the LogicalToPhysical
> compiler generates LOForeach for it.
> 3) Explicitly insert LOForEach to do necessary casting between Union and the
> problematic input. This is analogous to the way we implement implicit
> casting for expression operators.
> 4) Don't support "not same but compatible" case at all.
>
> I will do (3) because it makes the most sense to me plus incurs the least
> impact on other modules. Does anyone have problem with it?
>
> Pi
>
>