You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "David Ciemiewicz (Created) (JIRA)" <ji...@apache.org> on 2012/01/24 18:22:40 UTC

[jira] [Created] (PIG-2490) Add UDF function chaining syntax

Add UDF function chaining syntax
--------------------------------

                 Key: PIG-2490
                 URL: https://issues.apache.org/jira/browse/PIG-2490
             Project: Pig
          Issue Type: Improvement
            Reporter: David Ciemiewicz


Nested function/UDF calls make for very convoluted data transformations:

{code}
business1     9:00 AM - 4:00 PM
{code}

{code}
B = foreach A generate
    REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-') as hours_normalized.
{code}

Yes, you could recast this as but it's still rather convoluted.

{code}
B = foreach A {
    hours1 = REGEXREPLACE(hours,' AM\\b','a');
    hours2 = REGEXREPLACE(hours1,' PM\\b','p');
    hours3 = REGEXREPLACE(hours2,' *- *','-');
    generate
    hours3 as hours_normalized;
    };
{code}

I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.

{code}
B = foreach A generate
    REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-') as hours_normalized;
{code}

This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.

In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.

In other words, the following two expressions would be equivalent:

{code}
f(a,b)
a.f(b)
{code}

As such, I don't think there are any requirements to modify existing UDFs.

I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2490) Add UDF function chaining syntax

Posted by "David Ciemiewicz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Ciemiewicz updated PIG-2490:
----------------------------------

    Description: 
Nested function/UDF calls can make for very convoluted data transformations.

For example, give the following sample data:
{code}
business1     9:00 AM - 4:00 PM
{code}

Transforming it with Pig UDFs might look like the following to normalize hours to "9:00a-4:00p"
{code}
B = foreach A generate
    REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-')
        as hours_normalized.
{code}

Yes, you could recast this as but it's still rather convoluted.

{code}
B = foreach A {
    hours1 = REGEXREPLACE(hours,' AM\\b','a');
    hours2 = REGEXREPLACE(hours1,' PM\\b','p');
    hours3 = REGEXREPLACE(hours2,' *- *','-');
    generate
    hours3 as hours_normalized;
    };
{code}

I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.

{code}
B = foreach A generate
    REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-')
        as hours_normalized;
{code}

This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.

In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.

In other words, the following two expressions would be equivalent:

{code}
f(a,b)
a.f(b)
{code}

As such, I don't think there are any requirements to modify existing UDFs.

I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

  was:
Nested function/UDF calls make for very convoluted data transformations:

{code}
business1     9:00 AM - 4:00 PM
{code}

{code}
B = foreach A generate
    REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-')
        as hours_normalized.
{code}

Yes, you could recast this as but it's still rather convoluted.

{code}
B = foreach A {
    hours1 = REGEXREPLACE(hours,' AM\\b','a');
    hours2 = REGEXREPLACE(hours1,' PM\\b','p');
    hours3 = REGEXREPLACE(hours2,' *- *','-');
    generate
    hours3 as hours_normalized;
    };
{code}

I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.

{code}
B = foreach A generate
    REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-')
        as hours_normalized;
{code}

This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.

In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.

In other words, the following two expressions would be equivalent:

{code}
f(a,b)
a.f(b)
{code}

As such, I don't think there are any requirements to modify existing UDFs.

I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

    
> Add UDF function chaining syntax
> --------------------------------
>
>                 Key: PIG-2490
>                 URL: https://issues.apache.org/jira/browse/PIG-2490
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: David Ciemiewicz
>
> Nested function/UDF calls can make for very convoluted data transformations.
> For example, give the following sample data:
> {code}
> business1     9:00 AM - 4:00 PM
> {code}
> Transforming it with Pig UDFs might look like the following to normalize hours to "9:00a-4:00p"
> {code}
> B = foreach A generate
>     REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-')
>         as hours_normalized.
> {code}
> Yes, you could recast this as but it's still rather convoluted.
> {code}
> B = foreach A {
>     hours1 = REGEXREPLACE(hours,' AM\\b','a');
>     hours2 = REGEXREPLACE(hours1,' PM\\b','p');
>     hours3 = REGEXREPLACE(hours2,' *- *','-');
>     generate
>     hours3 as hours_normalized;
>     };
> {code}
> I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.
> {code}
> B = foreach A generate
>     REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-')
>         as hours_normalized;
> {code}
> This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.
> In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.
> In other words, the following two expressions would be equivalent:
> {code}
> f(a,b)
> a.f(b)
> {code}
> As such, I don't think there are any requirements to modify existing UDFs.
> I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2490) Add UDF function chaining syntax

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192327#comment-13192327 ] 

Russell Jurney commented on PIG-2490:
-------------------------------------

I think this is fantastic.
                
> Add UDF function chaining syntax
> --------------------------------
>
>                 Key: PIG-2490
>                 URL: https://issues.apache.org/jira/browse/PIG-2490
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: David Ciemiewicz
>
> Nested function/UDF calls can make for very convoluted data transformations.
> For example, give the following sample data:
> {code}
> business1     9:00 AM - 4:00 PM
> {code}
> Transforming it with Pig UDFs might look like the following to normalize hours to "9:00a-4:00p"
> {code}
> B = foreach A generate
>     REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-')
>         as hours_normalized.
> {code}
> Yes, you could recast this as but it's still rather convoluted.
> {code}
> B = foreach A {
>     hours1 = REGEXREPLACE(hours,' AM\\b','a');
>     hours2 = REGEXREPLACE(hours1,' PM\\b','p');
>     hours3 = REGEXREPLACE(hours2,' *- *','-');
>     generate
>     hours3 as hours_normalized;
>     };
> {code}
> I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.
> {code}
> B = foreach A generate
>     REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-')
>         as hours_normalized;
> {code}
> This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.
> In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.
> In other words, the following two expressions would be equivalent:
> {code}
> f(a,b)
> a.f(b)
> {code}
> As such, I don't think there are any requirements to modify existing UDFs.
> I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2490) Add UDF function chaining syntax

Posted by "David Ciemiewicz (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Ciemiewicz updated PIG-2490:
----------------------------------

    Description: 
Nested function/UDF calls make for very convoluted data transformations:

{code}
business1     9:00 AM - 4:00 PM
{code}

{code}
B = foreach A generate
    REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-')
        as hours_normalized.
{code}

Yes, you could recast this as but it's still rather convoluted.

{code}
B = foreach A {
    hours1 = REGEXREPLACE(hours,' AM\\b','a');
    hours2 = REGEXREPLACE(hours1,' PM\\b','p');
    hours3 = REGEXREPLACE(hours2,' *- *','-');
    generate
    hours3 as hours_normalized;
    };
{code}

I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.

{code}
B = foreach A generate
    REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-')
        as hours_normalized;
{code}

This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.

In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.

In other words, the following two expressions would be equivalent:

{code}
f(a,b)
a.f(b)
{code}

As such, I don't think there are any requirements to modify existing UDFs.

I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

  was:
Nested function/UDF calls make for very convoluted data transformations:

{code}
business1     9:00 AM - 4:00 PM
{code}

{code}
B = foreach A generate
    REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-') as hours_normalized.
{code}

Yes, you could recast this as but it's still rather convoluted.

{code}
B = foreach A {
    hours1 = REGEXREPLACE(hours,' AM\\b','a');
    hours2 = REGEXREPLACE(hours1,' PM\\b','p');
    hours3 = REGEXREPLACE(hours2,' *- *','-');
    generate
    hours3 as hours_normalized;
    };
{code}

I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.

{code}
B = foreach A generate
    REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-') as hours_normalized;
{code}

This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.

In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.

In other words, the following two expressions would be equivalent:

{code}
f(a,b)
a.f(b)
{code}

As such, I don't think there are any requirements to modify existing UDFs.

I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

    
> Add UDF function chaining syntax
> --------------------------------
>
>                 Key: PIG-2490
>                 URL: https://issues.apache.org/jira/browse/PIG-2490
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: David Ciemiewicz
>
> Nested function/UDF calls make for very convoluted data transformations:
> {code}
> business1     9:00 AM - 4:00 PM
> {code}
> {code}
> B = foreach A generate
>     REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(hours,' AM','a'), ' PM', 'p'), ' *- *', '-')
>         as hours_normalized.
> {code}
> Yes, you could recast this as but it's still rather convoluted.
> {code}
> B = foreach A {
>     hours1 = REGEXREPLACE(hours,' AM\\b','a');
>     hours2 = REGEXREPLACE(hours1,' PM\\b','p');
>     hours3 = REGEXREPLACE(hours2,' *- *','-');
>     generate
>     hours3 as hours_normalized;
>     };
> {code}
> I suggest an "object-style" function chaining enhancement to the grammar a la Java, JavaScript, etc.
> {code}
> B = foreach A generate
>     REGEXREPLACE(hours,' AM\\b','a').REGEXREPLACE(' PM\\b','p').REGEXREPLACE(' *- *','-')
>         as hours_normalized;
> {code}
> This chaining notation makes it much clearer as to the sequence of actions without the convoluted nesting.
> In the case of the "object-method" style dot (.) notation, the result of the prior expression is just used as the first value in the tuple passed to the function call.
> In other words, the following two expressions would be equivalent:
> {code}
> f(a,b)
> a.f(b)
> {code}
> As such, I don't think there are any requirements to modify existing UDFs.
> I think this is just a syntactic "sugar" enhancement that should be fairly trivial to implement, yet would make coding complex data transformations with Pig UDFs "cleaner".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira