You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Santhosh Srinivasan (JIRA)" <ji...@apache.org> on 2009/03/16 22:05:50 UTC

[jira] Commented: (PIG-685) Distinct UDF progress reports

    [ https://issues.apache.org/jira/browse/PIG-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682463#action_12682463 ] 

Santhosh Srinivasan commented on PIG-685:
-----------------------------------------

I am not able to reproduce this issue. My tests resulted in Distinct.Intermediate calling progress and the progress was reported.

> Distinct UDF progress reports
> -----------------------------
>
>                 Key: PIG-685
>                 URL: https://issues.apache.org/jira/browse/PIG-685
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>         Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01
>            Reporter: Tamir Kamara
>
> When using the DISTINCT function many of the map tasks are being killed because of failure to report for 600 seconds. It seems that PIG-646 should have addressed this but I'm still seeing many errors like this:
> 2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
> 2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> 2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> My query:
> r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
> r3 = GROUP r0 BY org parallel 18;
> r4 = FOREACH r3 {
>        r5 = r0.domain;
>        r6 = distinct r5;
>        GENERATE group as org, COUNT(r6) as domains;
> }
> store r4 into 'org-domain-count';
> the source files are 21GB in total with some 800M lines, 60M distinct domains and 80K distinct orgs. Some orgs have 50M domains in them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.