You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alejandro Abdelnur (JIRA)" <ji...@apache.org> on 2008/04/10 12:30:12 UTC

[jira] Created: (HADOOP-3229) Map OutputCollector does not report progress on writes

Map OutputCollector does not report progress on writes
------------------------------------------------------

                 Key: HADOOP-3229
                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
         Environment: all
            Reporter: Alejandro Abdelnur
             Fix For: 0.17.0


It seem that the collector implementation used during the map phase does not report progress on writing.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588105#action_12588105 ] 

Doug Cutting commented on HADOOP-3229:
--------------------------------------

> Do we need to call it for each record emitted?

Yes.  The goal is to note progress whenever user code calls collect().  If a slow-running mapper or reducer only outputs one tiny record every minute or so, and the task timeout is ten minutes, tasks should not time out.  User code should only have to explicitly report progress when they consume inputs or generate outputs at a rate lower than the task timeout.

> As demonstrated in HADOOP-2284, the overhead of setting this flag- as you assert- is slight, but not free.

That was worse, since sorting calls compare log(N) times per entry, not just once, but it's still a valid point.  If we find that setting the flag here significantly impacts performance, then we should explore changing the above contract.  But that's the contract we've advertised in the past, that either consuming or emitting an entry counted as task progress, no?

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3229:
----------------------------------

    Attachment: 3229-1.patch

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588140#action_12588140 ] 

Chris Douglas commented on HADOOP-3229:
---------------------------------------

*sigh* My mistake. I thought the Progressable was retained in FSDataOutputStream and progress was signaled with writes. It's not even signaled during the create, so adding it is completely pointless.

The reporting in CombineOutputCollector is fine, then. We still don't have a story for combiner-less spills; reporting progress for each partition written wouldn't be a bad idea, but I doubt it's required. +1 for your original patch.

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3229:
----------------------------------

      Resolution: Fixed
        Assignee: Doug Cutting
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Doug

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Doug Cutting
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588149#action_12588149 ] 

Hadoop QA commented on HADOOP-3229:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12379947/HADOOP-3229.patch
against trunk revision 645773.

    @author +1.  The patch does not contain any @author tags.

    tests included -1.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new javac compiler warnings.

    release audit +1.  The applied patch does not generate any new release audit warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests -1.  The patch failed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2212/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2212/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2212/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2212/console

This message is automatically generated.

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-3229:
---------------------------------

    Attachment: HADOOP-3229.patch

Reporter#progress() should be cheap enough that we can afford to call it for each entry, as in the attached patch.  All it does is set a flag.  Does this measurably hurt performance?

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3229:
----------------------------------

    Attachment: 3229-1.patch

Missed DirectMapOutputCollector

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589051#action_12589051 ] 

Hudson commented on HADOOP-3229:
--------------------------------

Integrated in Hadoop-trunk #461 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/461/])

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Doug Cutting
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3229:
----------------------------------

    Attachment: 3229-0.patch

Is this causing a problem? If you're running a combiner, then progress is reported for each record. Progress is also reported during the sort. It's also reported as records are read, which should be concurrent to writing. Are you seeing timeouts?

The attached patch reports progress once per partition, and at for every 10000 records written to a partition when not using a combiner.

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588082#action_12588082 ] 

Devaraj Das commented on HADOOP-3229:
-------------------------------------

+1 to Doug's patch

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588098#action_12588098 ] 

Chris Douglas commented on HADOOP-3229:
---------------------------------------

The flag is set by TrackedRecordReader for each record given to the map. Do we need to call it for each record emitted? It's also set after each call to reduce in the combiner (not each record; my mistake). I thought the problem was that the spill doesn't report progress after the sort, without a combiner ("on writing"). As demonstrated in HADOOP-2284, the overhead of setting this flag- as you assert- is slight, but not free.

If we wanted to set this flag after each record written, then we might as well add a SequenceFile::createWriter method that takes a Progressable object and a FSDataOutputStream, and cover both cases.

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588113#action_12588113 ] 

Chris Douglas commented on HADOOP-3229:
---------------------------------------

bq. If a slow-running mapper or reducer only outputs one tiny record every minute or so...

Shouldn't the progress update from TrackedRecordReader take care of that? I suppose there could be a map that reads a single record and emits several, small records over several minutes... OK, sold. For most cases, that sounds like a dead/dying node that probably _should_ get killed, but even if that were true this would be the wrong way to address it.

bq. that's the contract we've advertised in the past, that either consuming or emitting an entry counted as task progress

Setting the flag on collect from the map, as above, I agree. We're missing a case by only following collect(), though: should we signal progress while we're spilling to disk, without a combiner? Passing the Reporter to the fs::create would be address tasks with/without a combiner, right?

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588132#action_12588132 ] 

Doug Cutting commented on HADOOP-3229:
--------------------------------------

The reporter passed to FileSystem#create() is to permit filesystems that might sometimes take a long time to open a file, like HDFS, to report progress during this time.  I don't know that it's worth passing it to the local filesystem in this case, but if we do, shouldn't we do it everywhere and not in just one place?

You dropped the reporting I added in CombineOutputCollector.  Why?


> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch, 3229-1.patch, 3229-1.patch, HADOOP-3229.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3229) Map OutputCollector does not report progress on writes

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3229:
----------------------------------

    Status: Patch Available  (was: Open)

> Map OutputCollector does not report progress on writes
> ------------------------------------------------------
>
>                 Key: HADOOP-3229
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3229
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: 3229-0.patch
>
>
> It seem that the collector implementation used during the map phase does not report progress on writing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.