You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2011/03/16 22:06:30 UTC

[jira] Created: (PIG-1911) Infinite loop with accumulator function in nested foreach

Infinite loop with accumulator function in nested foreach
---------------------------------------------------------

                 Key: PIG-1911
                 URL: https://issues.apache.org/jira/browse/PIG-1911
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.8.0
            Reporter: Olga Natkovich
            Assignee: Thejas M Nair
             Fix For: 0.8.0


Sample script:

register v_udf.jar;
a = load '2records' as (f1:chararray,f2:chararray);
b = group a by f1;
d = foreach b { sort = order a by f1; 
  generate org.udfs.MyCOUNT(sort) as something ; }
dump d;

This causes infinite loop if MyCOUNT implements Accumulator interface.

The workaround is to take the function out of nested foreach into a separate foreach statement.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-1911) Infinite loop with accumulator function in nested foreach

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair resolved PIG-1911.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.0

Patch committed to trunk and 0.8 branch.

> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
>                 Key: PIG-1911
>                 URL: https://issues.apache.org/jira/browse/PIG-1911
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Olga Natkovich
>            Assignee: Thejas M Nair
>             Fix For: 0.9.0, 0.8.0
>
>         Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1; 
>   generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1911) Infinite loop with accumulator function in nested foreach

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017334#comment-13017334 ] 

Daniel Dai commented on PIG-1911:
---------------------------------

+1, this is definitely a fix. Accumulator will only be used if there is an accumulator UDF in nested plan. So fix inside UDF should be fine.

Just help me to understand better, I think fix PORelationToExprProject is also possible. Since accumulator only need one extra bag to in order for UDF to invoke getValue(). So after exhaust all batch, send one extra bag, then send EOP, will solve the problem as well. Is that right?

> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
>                 Key: PIG-1911
>                 URL: https://issues.apache.org/jira/browse/PIG-1911
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Olga Natkovich
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1; 
>   generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1911) Infinite loop with accumulator function in nested foreach

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1911:
-------------------------------

    Attachment: PIG-1911.trunk.1.patch

> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
>                 Key: PIG-1911
>                 URL: https://issues.apache.org/jira/browse/PIG-1911
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Olga Natkovich
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1; 
>   generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1911) Infinite loop with accumulator function in nested foreach

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1911:
-------------------------------

    Attachment: PIG-1911.08.1.patch

PIG-1911.08.1.patch  - patch for 0.8 branch.
Unit tests and test-patch passed.


> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
>                 Key: PIG-1911
>                 URL: https://issues.apache.org/jira/browse/PIG-1911
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Olga Natkovich
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1911.08.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1; 
>   generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1911) Infinite loop with accumulator function in nested foreach

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017485#comment-13017485 ] 

Thejas M Nair commented on PIG-1911:
------------------------------------

bq. Just help me to understand better, I think fix PORelationToExprProject is also possible. Since accumulator only need one extra bag to in order for UDF to invoke getValue(). So after exhaust all batch, send one extra bag, then send EOP, will solve the problem as well. Is that right?

I looked at that option first, but the problem is that POUserFunc is expected to be called with isAccumStarted() == false and result.returnStatus == STATUS_OK. In case of a relation like -
F = foreach IN { SBCOL = order BCOL by $1; FBCOL = filter SBCOL by 1 == 2; generate COUNT(FBCOL.$0);}
 FBCOL will have nothing to return.With the approach you mention here - The first call to the plan will be made with isAccumStarted() == true, and PORelationToExprProject will return an empty bag. Another call will be made with isAccumStarted() == false, and this time it will return STATUS_EOP. THis would mean that the udf.cleanup() will not get called. To avoid this, we would need to handle STATUS_EOP differently in POUserFunc.processInput() in accumulative mode. That seemed a little less clean than the approach I finally took.


> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
>                 Key: PIG-1911
>                 URL: https://issues.apache.org/jira/browse/PIG-1911
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Olga Natkovich
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1; 
>   generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1911) Infinite loop with accumulator function in nested foreach

Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007815#comment-13007815 ] 

Vivek Padmanabhan commented on PIG-1911:
----------------------------------------

In this case pig is calling getValue() and cleanup() methods infinitely. The below is the udf source just in case;
{code}
public class MyCOUNT extends EvalFunc<Long> implements  Accumulator<Long>{
    @Override
    public Long exec(Tuple input) throws IOException {
            DataBag bag = (DataBag)input.get(0);
            Iterator it = bag.iterator();
            long cnt = 0;
            while (it.hasNext()){
                    Tuple t = (Tuple)it.next();
                    if (t != null && t.size() > 0 && t.get(0) != null )
                            cnt++;
            }
            return cnt;
    }

    @Override
    public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(null, DataType.LONG)); 
    }
    private long intermediateCount = 0L;
    @Override
    public void accumulate(Tuple b) throws IOException {
            DataBag bag = (DataBag)b.get(0);
            Iterator it = bag.iterator();
            while (it.hasNext()){
                Tuple t = (Tuple)it.next();
                if (t != null && t.size() > 0 && t.get(0) != null) {
                    intermediateCount += 1;
                }
            }
    }
    @Override
    public void cleanup() {
        intermediateCount = 0L;
    }
    @Override
    public Long getValue() {
        return intermediateCount;
    }
}
{code}

> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
>                 Key: PIG-1911
>                 URL: https://issues.apache.org/jira/browse/PIG-1911
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Olga Natkovich
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1; 
>   generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira