You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2010/07/01 20:13:50 UTC

[jira] Commented: (PIG-1434) Allow casting relations to scalars

    [ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884369#action_12884369 ] 

Dmitriy V. Ryaboy commented on PIG-1434:
----------------------------------------

A couple of thoughts that came out of the Pig conributor meeting:

1) rather than scalar, we should make this work for single-tuple relations. That way a user can do something like this: 

{code}
A = load 'data' as (x, y, z);
B = group A all;
C = foreach B generate COUNT(A) as count, MAX(A.y) as max;
.....
X = ....
Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
{code}

2) Writing the intermediate relation to a file can cause hotspots. We should push this into the distributed cache. In cases when the dist. cache is turned off, we can at least increase the replication factor to some large-ish number (10, maybe, like the jobs?)

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be reported
> (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.