You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2008/09/11 21:55:44 UTC

[jira] Created: (PIG-427) casting parameters of a UDF

casting parameters of a UDF
---------------------------

                 Key: PIG-427
                 URL: https://issues.apache.org/jira/browse/PIG-427
             Project: Pig
          Issue Type: Improvement
    Affects Versions: types_branch
            Reporter: Olga Natkovich
             Fix For: types_branch


Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.

This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-427) casting parameters of a UDF

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-427:
-----------------------------------------------

    Attachment: 427-1.patch

Patch with Olga's comments included

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-427) casting parameters of a UDF

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-427:
-----------------------------------------------

    Status: Patch Available  (was: Open)

The patch implements the following logic:

checks if a fit is possible and returns a score if so. The lesser the score the better the fit.
A table of possible casts is maintained and the table is ordered so as produce a sensible heuristic for the fit score. The principle behind the heuristic is that it tries to choose lesser number of casts and if the number of casts is same tries to choose conversions to a smaller type where ordering of types is:
        INTEGER, LONG, FLOAT, DOUBLE, CHARARRAY, TUPLE, BAG, MAP (from small to big)

Once the best fit is determined, casts are introduced to suit that fit. However, if the schema contains a schema embedded as a Tuple or a Bag, the bestFit function wants these schemas to match exactly. For ex., if SUM provides a mapping to BAG(integers} & BAG(floats), and we have BAG(longs) as input, the best fit doesn't try to insert a cast here because the nesting here can be arbitrary and finding the right project where the cast should be inserted is a bit complicated.

The patch also includes a test case which tests three scenarios for casting.

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-427) casting parameters of a UDF

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-427:
-----------------------------------------------

    Attachment: 427-3.patch

With all the changes, had left out the statement to set the found matching spec as the new func spec in LOUserFunc.

One extraneous change is in src/org/apache/pig/pen/EquivalenceClasses.java which has an unused import which was causing an error in eclipse. Removed that in this patch.

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427-2.patch, 427-3.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-427:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

patch committed; thanks, shravan!

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427-2.patch, 427-3.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632763#action_12632763 ] 

Olga Natkovich commented on PIG-427:
------------------------------------

I want to update #4 of my original comments. We should cast bytearray if UDF map only consists of a single function.

This would help with backward compatibility.

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635018#action_12635018 ] 

Olga Natkovich commented on PIG-427:
------------------------------------

I am seeing 2 problems with the patch while ranning some tests:

Script 1:

a = load 'studentnulltab10k' as (name:chararray, age:int, gpa:double);
b = group a ALL;
c = foreach b generate SUM(a.age);

the output looks like:

(449650.0)

Notice that it got casted to double even though we have version of SUM that accepts an int and produces long

Script 2:

a = load 'studentnulltab10k' as (name:chararray, age:int, gpa:double);
b = group a ALL;
c = foreach b generate MIN(a.name);

This query fails with the following error stack:

2008-09-26 13:54:28,374 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (map) task_200809241441_1550_m_000000java.io.IOException: Received Error while processing the reduce plan.
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:166)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:56)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:904)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:785)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427-2.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-427) casting parameters of a UDF

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-427:
-----------------------------------------------

    Attachment: 427-2.patch

Have addressed olga's comments. There was a bug in 427-1.patch. I think I have resolved it now. It made the implicit assumption that if byte array is found and we have a single func defined, then the matching func is the defined one. However, this need not be the case. So need to check if a fit is possible. For ex, we might have (bytearray, int) & func defined might have (long, tuple). In this case, we need to fail.

Also, hopefully the code is more readable now.

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427-2.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-427) casting parameters of a UDF

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-427:
-----------------------------------------------

    Attachment: 427.patch

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632422#action_12632422 ] 

Olga Natkovich commented on PIG-427:
------------------------------------

Hi Shravan,

Thanks for the patch. I have a couple of comments:

(1) We don't support BOOLEAN is type in the language so we don't need it in the mapping
(2) I don't think we should cast numeric types to chararrays because we don't know what encoding to use
(3) I don't think we should cast bytearrays to anything implicitely since we don't know what is a safe cast in this case
(4) I think that if multiple function get the same score, we should say that it is ambiguous and ask for explicit cast. For instance, we have 2 functions (int, float) and (float, int) and the input (int, int) - we should say that we can't choose.

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633477#action_12633477 ] 

Olga Natkovich commented on PIG-427:
------------------------------------

Please ignore (3) on my last comment

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-427:
----------------------------------

    Assignee: Shravan Matthur Narayanamurthy

> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-427) casting parameters of a UDF

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633395#action_12633395 ] 

Olga Natkovich commented on PIG-427:
------------------------------------

Looks good. A couple of comments

(1) exact schema match comparison should be made before bytearray comparison. You can have a UDF that takes a bytearray as parameter. In that case it does not matter how many functions are present in the table.

(2) it would be good to have comment explaining what the rules are about bytearrays and also about multiple matches

(3) Looks like the code is making an assumtion that scores will be returned in the score order. I was not quite sure why.



> casting parameters of a UDF
> ---------------------------
>
>                 Key: PIG-427
>                 URL: https://issues.apache.org/jira/browse/PIG-427
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 427-1.patch, 427.patch
>
>
> Currently if we have a UDF that declares via getArgToFuncMapping that it can only handle a subset of types, passing any other types to the function would result in an error. However, some types can be promoted to others and it would be useful if typechecker to perform best fit cast. For instance, if the input parameter has type of Long and the UDF support Int and Double, the code should cast the paraneter into Double.
> This would be very useful for conversion of the UDFs from the piigybank to the new code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.