You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2010/06/04 01:03:55 UTC

[jira] Created: (PIG-1434) Allow casting relations to scalars

Allow casting relations to scalars
----------------------------------

                 Key: PIG-1434
                 URL: https://issues.apache.org/jira/browse/PIG-1434
             Project: Pig
          Issue Type: Improvement
            Reporter: Olga Natkovich
             Fix For: 0.8.0


This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.

The proposal is to allow casting relations to scalar types in foreach.

Example:

A = load 'data' as (x, y, z);
B = group A all;
C = foreach B generate COUNT(A);
.....
X = ....
Y = foreach X generate $1/(long) C;

Couple of additional comments:

(1) You can only cast relations including a single value or an error will be reported
(2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
(3) Y will look for C closest to it.

Implementation thoughts:

The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
(1) Store C
(2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Patch Available  (was: Open)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886772#action_12886772 ] 

Daniel Dai commented on PIG-1434:
---------------------------------

We may also add some sanity check, instead of just doing a limit.
{code}
C = foreach C generate CheckSingular(*);
Z = join X by 1, C by 1 using 'replicated';
Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}

CheckSingular will check if C only have one record.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImpl1.patch

LOScalar keeps track of the scalars in the logical plan along with the reference to the scalar alias.
During compilation, we add LOStores to respective scalars, we also merge plans as needed.
POScalar is later replaced by POUserFunc and appropriate dependency is added between the MROpers.
Tested with store, dump, explain.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884501#action_12884501 ] 

Thejas M Nair commented on PIG-1434:
------------------------------------

bq. Do we fail if we find C to have more than one row, or do we just ignore it? 
I think we should fail in that case, rather than have a surprising/unpredictable behavior.  

bq. Should we try to detect, that C has one row, in frontend?
This will not be possible in all cases. Eg if input has only one row, or a filter is filtering out all except one row.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884508#action_12884508 ] 

Thejas M Nair commented on PIG-1434:
------------------------------------

bq. We can try to detect the pattern that makes something as scalar (by marking B (group by all, limit 1) as scalar and then C as scalar etc) and fail upfront otherwise...
I don't think we should limit this feature to the query patterns where we can detect that in frontend. In some cases it will depend on the data.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: scalarImpl.patch

Initial implemenation

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888073#action_12888073 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

I mean the cases where users type in some alias which are currently not possible to include inside a foreach statement and accidentally getting them treated as scalar by pig and then failing scripts at runtime (and may not fail in one-liner sample cases).
For example-
Y = foreach Z generate C.$0; where C is not a scalar. Currently, this would throw an error upfront, for erroneous usage (logical (do not know restrictions on foreach statement) or typing mistake) of C, But, after we add support of scalars, pig may conclude C to be used as a scalar and generate the plans accordingly.
By introducing square bracketed syntax we can make sure that user intended to use C as a scalar and it wasn't introduced by mistake.A cast would also work for this, but as we have introduced scalar projections (C.count, C.max etc), we already have cases wherein user may mean to cast fields(count, max) rather than scalars themselves.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1434:
--------------------------------

    Release Note: 
PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred chararray rather than bytearray is used.

 

The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"



  was:
PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules.

 

The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"




> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899042#action_12899042 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

Comments on the finalized syntax--

With the above changes Pig now supports -
{code}
Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
{code}
1. Casts are *optional* and the datatype of scalar depends on the schema of C (ie depending on the schema of C, we add the casts implicitly. So, typically, count is a long and max is a double). In case of undeclared(null) schema for C, default type of scalar is *chararray*.

2. Projections are mandatory. For example
{code}
Y = foreach X generate C; // is an *error*
{code}
We need to use-
{code}
Y = foreach X generate C.$0; 
{code}

3. Check if C is a scalar or not is not performed until runtime, thus it will fail at the time of execution of UDF with ExecException("Scalar has more than one row in the output").

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1434:
------------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed
            Tags: documentation

The patch is committed to trunk. Thanks Aniket for contributing this feature.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Patch Available  (was: Open)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Open  (was: Patch Available)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881784#action_12881784 ] 

Daniel Dai commented on PIG-1434:
---------------------------------

We decide to change some implementation to solve the following problem:
1. To decide when to add store. Currently, we parse statement by statement, until we saw a store, we merge that branch into the integrated logical plan. If we add store too late, the merge algorithm cannot see the store and discard this branch. If we add store too early (during the parsing of Y, in the example), then later we do not store/dump Y, we get a redundant store for C
2. Implicit dependency between C -> Y. C will create a side file and Y will use it. However, this is not the normal data flow and should not be represented as a connection in logical plan

Now we are exploring the following implementation:
1. Add LOScalar, POScalar to represent scalar expression
2. When parsing Y, we put LOScalar as a placeholder in the ForEach inner plan
3. When parsing store (Y), we know we need to merge the store branch. In the mean time, we check the branch (Y) if it contains a scalar, if so, find what the scalar refers to (C), add a store to that branch, and merge that branch to the integrated logical plan
4. Add a map reduce layer optimizer ScalarOptimizer. It check for map-reduce job contains POScalar, and map-reduce job POScalar contains the operator POScalar refers to, create a dependency between these two map-reduce jobs. ScalarOptimizer should run before MultiQueryOptimizer

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImpl5.patch

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment:     (was: ScalarImpl1.patch)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884794#action_12884794 ] 

Richard Ding commented on PIG-1434:
-----------------------------------


So all one needs to do is internally replace the line: 

{code}
Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
{code}

with

{code}
Z = join X by 1, C by 1 using 'replicated';
Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}



> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Patch Available  (was: Open)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884755#action_12884755 ] 

Dmitriy V. Ryaboy commented on PIG-1434:
----------------------------------------

Richard,
In my opinion -- Yes in principle, but not as a replacement for this. 
Cross is dangerous, I would rather have a constrained "implicit scalar tuple" that people use for the common case, and leave something like a replicated cross for power users (and then not limit it to number of rows in replicated relation).

Agreed with not constraining this feature on the frontend, failing at runtime instead. 

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Patch Available  (was: Open)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImpl1.patch

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883958#action_12883958 ] 

Thejas M Nair commented on PIG-1434:
------------------------------------

Should we have a special syntax when relational alias is used as a scalar ? Something like "[C]" (instead of just "C") .
I think the special sytax is useful because -
1. Users might accidentally use relation-op alias in this scalar context and not figure out the problem until runtime (when the evaluation of the relation results in more than one row).
2. It will prevent surprises if a column with same name as alias is introduced in a new version of input data .  Assuming the load function implements LoadMetadata.getSchema() and the load statement does not specify a new schema. With current plan, if there is a new column by same name, the column gets used instead of relation-op value. This will give different results, not what the user expects.

 

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886825#action_12886825 ] 

Daniel Dai commented on PIG-1434:
---------------------------------

We also need to enforce C only have one part file to do the check (use limit to achieve it).
{code}
C = limit C 2;
C = foreach C generate CheckSingular(*);
Z = join X by 1, C by 1 using 'replicated';
Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImplFinale.patch

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882711#action_12882711 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

The proposal for scalars is as follows -
{code}
A = load '1.txt' as (a1, a2);
B = group A all;
C = foreach B generate COUNT(A);
Y = foreach A generate C;
store Y into 'Ystore';
{code}
Based on the schema of C, we detect that Y means to use C as a scalar and internally track it as scalar. Thus, operations like C * C are also allowed. The limitation is that C should have long convertible value (when stored into the file). Also (int) C would be allowed and will succeed if the cast operation succeeds.

As mentioned by Daniel earlier, there are two challenges in introducing scalars--
1. Addition of implicit store- We cannot do it too early (parsing), as we get redundant (implicit) store operation for rest of the commands in the script. If we do it too late, merge algorithm doesn't find the store and discards the branch that compiles and executes the store.
To solve this, whenever we process a store plan after the parsing stage, we detect the existence of scalars into the plan and add required branches that has those scalars into the current plan. We also attach LOStores for the scalars and merge the required plan.
2. Tracking of implicit dependency- Existence of scalar C needs to be converted into a implicit ReadScalar operation, but other than this it also needs to add dependency on the map-reduce job that generates this scalar value. We track this dependency by adding LOScalar, POScalar operators that carry the reference to the scalar they depend upon. When we compile the map reduce plan, we replace POScalar with POUserFunc to load the scalar value and mark the dependency between two map reduce jobs.

I am attaching the patch with above mentioned changes.

Few known issues-
To track the dependencies of scalars, we need access to map of operators from one type of plan to other, but this map is generated by visitors. The same visitors are responsible for converting LOScalar ->POScalar -> POUserFunc. So, if a visitor visits LOScalar before LO associated with scalar ( C in example) we do not find PO associated with C. 

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Open  (was: Patch Available)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882725#action_12882725 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

Submitting to hudson to check for test failures

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImplFinale1.patch

Removed the unused variable(findbug warning)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Open  (was: Patch Available)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884759#action_12884759 ] 

Richard Ding commented on PIG-1434:
-----------------------------------

I agree that we should use the right syntax. What I meant was that it can be implemented as a 'replicated' cross which seems to solve the problems of implicit dependency and using distributed cache.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886914#action_12886914 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

Adding this support makes pig code complicated/hacky, because we conclude any not parsed alias (AliasFieldOrSpec) as scalar and try to resolve it as scalar at runtime.

To simplify, square bracketed syntax is a better idea, for example- 
{code}
Y = foreach Z generate X::$1/(long) [C].count, X::$2-(long) [C].max;
{code}
Otherwise, such queries (if typed by mistakes) can result into non-intuitive errors for users.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Patch Available  (was: Open)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894837#action_12894837 ] 

Hadoop QA commented on PIG-1434:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451096/ScalarImplFinale1.patch
  against trunk revision 980930.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    -1 release audit.  The applied patch generated 409 release audit warnings (more than the trunk's current 403 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/368/console

This message is automatically generated.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImplFinale1.patch

Missing field in scalar file are handled by returning null. Empty scalar file/empty scalar directory tested.
ScalarPhyFinder is moved as local variable. Removed redundant comments and apis inside visitors.
Added a new testcase for multiquery.
Fixed findbugs, javac and javadoc warnings (needs findbugs exclusion since we throw an error when second line is found (not_null) in UDF).


> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888051#action_12888051 ] 

Alan Gates commented on PIG-1434:
---------------------------------

bq. Adding this support makes pig code complicated/hacky, because we conclude any not parsed alias (AliasFieldOrSpec) as scalar and try to resolve it as scalar at runtime.

I don't understand when we'll see this issue.  Could you give me examples of errors users might make that we'll miss.  I definitely don't like the brackets.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888449#action_12888449 ] 

Alan Gates commented on PIG-1434:
---------------------------------

Alright, I finally understand.  I think the potential confusion for the user and the Pig parser is caused by the proposed way to handle multi-columned input.  Rather than

{code}
Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}

if we instead do
{code}
Y = foreach Z generate X::$1/((tuple)C).count, X::$2 - ((tuple)C).max;
{code}

then I believe it is clear for both user and parser what is happening.

In each case C is being cast to a tuple and then fields read out of it.  C is not being cast to a long.  Then the feature remains basically as originally proposed.  The relation being cast must have one record and one field.  That one field can be a tuple to handle the case where the record has multiple fields.  But Pig will still reads it as a single column which is a tuple, and the user will need to cast it accordingly.

This should also avoid accidental usage.  In the example above:
{code}
Y = foreach Z generate X::$1/C.count, X::$2 - C.max;
{code}
should still be an error because the type checker should not be able to find C as a tuple anywhere in its symbol table.


> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1434:
--------------------------------

    Release Note: 
PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules.

 

The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"



> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882732#action_12882732 ] 

Hadoop QA commented on PIG-1434:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448098/scalarImpl.patch
  against trunk revision 958053.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/351/console

This message is automatically generated.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900099#action_12900099 ] 

Thejas M Nair commented on PIG-1434:
------------------------------------

bq. 1. Casts are optional and the datatype of scalar depends on the schema of C (ie depending on the schema of C, we add the casts implicitly. So, typically, count is a long and max is a double). In case of undeclared(null) schema for C, default type of scalar is chararray.

The default type everywhere else in pig is bytearray. I think we should follow that convention here as well . Any reason not to do that ? (The change can be part of a separate jira.)


> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890159#action_12890159 ] 

Hadoop QA commented on PIG-1434:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12449903/ScalarImpl1.patch
  against trunk revision 965559.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to cause Findbugs to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/347/testReport/
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/347/console

This message is automatically generated.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1434:
-------------------------------

    Release Note: 
PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred bytearray is used.

 

The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"



  was:
PIG-1434 adds functionality that allows to cast elements of a single-tuple relation into a scalar value. The primary use case for this is using values of global aggregates in the follow up computations. For instance,

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular user. Note that if the SUM as not given a name, a position can be used as well (userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an implict cast would be inserted according to regular Pig rules. Also, please, note that when the schema can't be inferred chararray rather than bytearray is used.

 

The relation can be used in any place where an expression of the type would make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: "Scalar has more than one row in the output"




Changed the release note to incorporate the change of default datatype to bytearray in PIG-1572

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894657#action_12894657 ] 

Richard Ding commented on PIG-1434:
-----------------------------------

It looks really good. Additional corner cases need to be considered:

* missing field in scalar file
* empty scalar file
* empty input directory

Some minor issues:

*  In PigServer. FileLocalizer.getTemporaryPath should be changed to use non-deprecated method.
*  In ScalarFinder, method isScalarPresent is not used.
*  In MRCompiler, variable scalarPhyFinder should be local so that ScalarPhyFinder can be simplified.

Also add a test case for using scalar in multi-query would be good.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment:     (was: ScalarImplFinale1.patch)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888546#action_12888546 ] 

Dmitriy V. Ryaboy commented on PIG-1434:
----------------------------------------

+1 for casting as tuple.  Though it may have to look like

{code}
Y = foreach Z generate X::$1/(long) ((tuple)C).count, X::$2 - (long) ((tuple)C).max;
{code}

Definitely -1 on the bracket syntax.. it seems very non-intuitive. 

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884503#action_12884503 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

bq Should we try to detect, that C has one row, in frontend?
We can try to detect the pattern that makes something as scalar (by marking B (group by all, limit 1) as scalar and then C as scalar etc) and fail upfront otherwise...

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884807#action_12884807 ] 

Dmitriy V. Ryaboy commented on PIG-1434:
----------------------------------------

I think it's important in this particular case to ensure that C only contains one tuple, since multiple tuples will lead to very confusing output.
Maybe

{code}
C = LIMIT C 1;
Z = join X by 1, C by 1 using 'replicated';
Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}

(this is where an optimization that notes that you don't need m/r boundary if the relation you are limiting only has one partition would come in handy).

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884750#action_12884750 ] 

Richard Ding commented on PIG-1434:
-----------------------------------


How about a "replicated" cross?

{code}
A = load 'data' as (x, y, z);
B = group A all;
C = foreach B generate COUNT(A) as count, MAX(A.y) as max;
.....
X = ....
Y = cross X, C using 'repl';
Z = foreach Y generate X::$1/(long) C.count, X::$2-(long) C.max;
{code}


> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884791#action_12884791 ] 

Thejas M Nair commented on PIG-1434:
------------------------------------

I think the replicated cross is a good alternative to this feature, though this feature is probably more friendly for a beginner pig user. But if this feature makes the pig code very complicated/hacky (the dependency order and stuff), I think it might not be a bad idea to encourage the use of replicated-join instead .

As a side note, we can actually get 'replicated cross' working using replicated join -
eg -
{code}
 j = join l1 by 1, l2 by 1 using 'replicated';
{code}

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment:     (was: ScalarImpl1.patch)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884498#action_12884498 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

I agree to Thejas that we should have a way for user to specify that he means to use C as scalar. This will avoid errors in pig code. 
Thus, we have, 
{code}
Y = foreach X generate $1/(int) [C].count, $2- [C],max, ([C].$1+2);
{code}
Do we fail if we find C to have more than one row, or do we just ignore it? Should we try to detect, that C has one row, in frontend?

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893939#action_12893939 ] 

Hadoop QA commented on PIG-1434:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12450872/ScalarImplFinale.patch
  against trunk revision 980276.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    -1 javac.  The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings).

    -1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

    -1 release audit.  The applied patch generated 406 release audit warnings (more than the trunk's current 400 warnings).

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/366/console

This message is automatically generated.

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884502#action_12884502 ] 

Dmitriy V. Ryaboy commented on PIG-1434:
----------------------------------------

SQL fails at runtime when executing queries that require a single row to be returned. So, oracle won't complain if you do this, for example:

{code}

SELECT foo.a, (SELECT c 
               FROM bar 
               WHERE foo.a = bar.a) 
from foo

{code}

unless the inner select produces more than one row. I think we should adopt the same approach -- assume the query is innocent until proven guilty.

-D

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1434) Allow casting relations to scalars

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-1434:
-----------------------------------

    Assignee: Aniket Mokashi

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Status: Open  (was: Patch Available)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1434) Allow casting relations to scalars

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884369#action_12884369 ] 

Dmitriy V. Ryaboy commented on PIG-1434:
----------------------------------------

A couple of thoughts that came out of the Pig conributor meeting:

1) rather than scalar, we should make this work for single-tuple relations. That way a user can do something like this: 

{code}
A = load 'data' as (x, y, z);
B = group A all;
C = foreach B generate COUNT(A) as count, MAX(A.y) as max;
.....
X = ....
Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
{code}

2) Writing the intermediate relation to a file can cause hotspots. We should push this into the distributed cache. In cases when the dist. cache is turned off, we can at least increase the replication factor to some large-ish number (10, maybe, like the jobs?)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImpl1.patch

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1434) Allow casting relations to scalars

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-1434:
--------------------------------

    Attachment: ScalarImplFinaleRebase.patch

Attaching rebased version of the patch...

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.