You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2011/03/16 22:06:30 UTC
[jira] Created: (PIG-1911) Infinite loop with accumulator function
in nested foreach
Infinite loop with accumulator function in nested foreach
---------------------------------------------------------
Key: PIG-1911
URL: https://issues.apache.org/jira/browse/PIG-1911
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Thejas M Nair
Fix For: 0.8.0
Sample script:
register v_udf.jar;
a = load '2records' as (f1:chararray,f2:chararray);
b = group a by f1;
d = foreach b { sort = order a by f1;
generate org.udfs.MyCOUNT(sort) as something ; }
dump d;
This causes infinite loop if MyCOUNT implements Accumulator interface.
The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-1911) Infinite loop with accumulator
function in nested foreach
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair resolved PIG-1911.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.9.0
Patch committed to trunk and 0.8 branch.
> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
> Key: PIG-1911
> URL: https://issues.apache.org/jira/browse/PIG-1911
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Olga Natkovich
> Assignee: Thejas M Nair
> Fix For: 0.9.0, 0.8.0
>
> Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1;
> generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1911) Infinite loop with accumulator
function in nested foreach
Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017334#comment-13017334 ]
Daniel Dai commented on PIG-1911:
---------------------------------
+1, this is definitely a fix. Accumulator will only be used if there is an accumulator UDF in nested plan. So fix inside UDF should be fine.
Just help me to understand better, I think fix PORelationToExprProject is also possible. Since accumulator only need one extra bag to in order for UDF to invoke getValue(). So after exhaust all batch, send one extra bag, then send EOP, will solve the problem as well. Is that right?
> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
> Key: PIG-1911
> URL: https://issues.apache.org/jira/browse/PIG-1911
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Olga Natkovich
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1;
> generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1911) Infinite loop with accumulator function
in nested foreach
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated PIG-1911:
-------------------------------
Attachment: PIG-1911.trunk.1.patch
> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
> Key: PIG-1911
> URL: https://issues.apache.org/jira/browse/PIG-1911
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Olga Natkovich
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1;
> generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1911) Infinite loop with accumulator function
in nested foreach
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated PIG-1911:
-------------------------------
Attachment: PIG-1911.08.1.patch
PIG-1911.08.1.patch - patch for 0.8 branch.
Unit tests and test-patch passed.
> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
> Key: PIG-1911
> URL: https://issues.apache.org/jira/browse/PIG-1911
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Olga Natkovich
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1911.08.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1;
> generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1911) Infinite loop with accumulator
function in nested foreach
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017485#comment-13017485 ]
Thejas M Nair commented on PIG-1911:
------------------------------------
bq. Just help me to understand better, I think fix PORelationToExprProject is also possible. Since accumulator only need one extra bag to in order for UDF to invoke getValue(). So after exhaust all batch, send one extra bag, then send EOP, will solve the problem as well. Is that right?
I looked at that option first, but the problem is that POUserFunc is expected to be called with isAccumStarted() == false and result.returnStatus == STATUS_OK. In case of a relation like -
F = foreach IN { SBCOL = order BCOL by $1; FBCOL = filter SBCOL by 1 == 2; generate COUNT(FBCOL.$0);}
FBCOL will have nothing to return.With the approach you mention here - The first call to the plan will be made with isAccumStarted() == true, and PORelationToExprProject will return an empty bag. Another call will be made with isAccumStarted() == false, and this time it will return STATUS_EOP. THis would mean that the udf.cleanup() will not get called. To avoid this, we would need to handle STATUS_EOP differently in POUserFunc.processInput() in accumulative mode. That seemed a little less clean than the approach I finally took.
> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
> Key: PIG-1911
> URL: https://issues.apache.org/jira/browse/PIG-1911
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Olga Natkovich
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1911.08.1.patch, PIG-1911.trunk.1.patch
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1;
> generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1911) Infinite loop with accumulator
function in nested foreach
Posted by "Vivek Padmanabhan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007815#comment-13007815 ]
Vivek Padmanabhan commented on PIG-1911:
----------------------------------------
In this case pig is calling getValue() and cleanup() methods infinitely. The below is the udf source just in case;
{code}
public class MyCOUNT extends EvalFunc<Long> implements Accumulator<Long>{
@Override
public Long exec(Tuple input) throws IOException {
DataBag bag = (DataBag)input.get(0);
Iterator it = bag.iterator();
long cnt = 0;
while (it.hasNext()){
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null )
cnt++;
}
return cnt;
}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.LONG));
}
private long intermediateCount = 0L;
@Override
public void accumulate(Tuple b) throws IOException {
DataBag bag = (DataBag)b.get(0);
Iterator it = bag.iterator();
while (it.hasNext()){
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null) {
intermediateCount += 1;
}
}
}
@Override
public void cleanup() {
intermediateCount = 0L;
}
@Override
public Long getValue() {
return intermediateCount;
}
}
{code}
> Infinite loop with accumulator function in nested foreach
> ---------------------------------------------------------
>
> Key: PIG-1911
> URL: https://issues.apache.org/jira/browse/PIG-1911
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Olga Natkovich
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Sample script:
> register v_udf.jar;
> a = load '2records' as (f1:chararray,f2:chararray);
> b = group a by f1;
> d = foreach b { sort = order a by f1;
> generate org.udfs.MyCOUNT(sort) as something ; }
> dump d;
> This causes infinite loop if MyCOUNT implements Accumulator interface.
> The workaround is to take the function out of nested foreach into a separate foreach statement.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira