You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Ravi Gummadi (JIRA)" <ji...@apache.org> on 2009/06/17 21:54:07 UTC

[jira] Created: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

distcp should place the file distcp_src_files in distributed cache
------------------------------------------------------------------

                 Key: HADOOP-6072
                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
             Project: Hadoop Core
          Issue Type: Improvement
          Components: tools/distcp
    Affects Versions: 0.21.0
            Reporter: Ravi Gummadi
             Fix For: 0.21.0


When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:

09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
        at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
        at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


This could be because of HADOOP-6038 and/or HADOOP-4681.

If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721120#action_12721120 ] 

Ravi Gummadi commented on HADOOP-6072:
--------------------------------------

Yes. Increasing replication is another solution. But since we dont know the number of parallel maps that can run, setting replication to a fixed value like 10 may not be enough for some cases and also can become overhead on namenode. So distributed cache could be better ?

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721437#action_12721437 ] 

Doug Cutting commented on HADOOP-6072:
--------------------------------------

> Is that still OK fornamenode's perf ?

This should not be a problem for the namenode.  It would be best to write the file first with normal replication, then increase its replication, to avoid an overly-long HDFS write pipeline.

The rationale for sqrt is that a two-stage fanout is done: first from the original to the replicas, then from the replicas to the maps.  Sqrt(maps) uses approximately the same fanout factor at each stage, minimizing the number of datanode clients (the presumed bottleneck here).

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721902#action_12721902 ] 

Doug Cutting commented on HADOOP-6072:
--------------------------------------

Some comments on the patch:
 - FSShell should not be used to increase replication, rather just call FileSystem#setReplication().
 - The replication should be set immediately after the file is closed, so that it has a chance to get replicated while duplicates are checked before the job is submitted.



> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721394#action_12721394 ] 

Doug Cutting commented on HADOOP-6072:
--------------------------------------

Distributed cache would give every mapper a full copy of the file, but each only reads a portion of the file, so I think increased replication is more appropriate than the distributed cache.

We can find the number of map slots with JobClient#getClusterStatus().getMaxMapTasks().  We might set the replication to the square root of that.  This should not overload the namenode worse than any other job.


> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Gummadi reassigned HADOOP-6072:
------------------------------------

    Assignee: Ravi Gummadi

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720869#action_12720869 ] 

Doug Cutting commented on HADOOP-6072:
--------------------------------------

Another option might be to increase its replication, no?

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-6072:
---------------------------------

    Comment: was deleted

(was: Some comments:
 - 'sleeptime' should be 'getSleeptime()' to be thread safe, no?  or maybe use int as a sleep time, since updates to an int are atomic.
 - getNumRunningMaps() is expensive to call from each node at each interval, since reports for all tasks must be retrieved from the JT.  better would be to just fetch the job's counters each time, since they're constant-sized, not proportional to the number of tasks.  You'd need to add a maps_completed counter, then use the difference between that and TOTAL_LAUNCHED_MAPS to calculate the number running.
 - the interval to contact the JT might be randomized a bit, so that not all tasks hit it at the same time, e.g., by adding a random value that's 10% of the specified value.
 - when InterruptedException is caught a thread should generally exit, not simply log a warning.  if things will no longer work correctly without the thread, then it should somehow cause other threads dependent threads to fail too.
 - getNumRunningMaps() should either return a correct value or throw an exception.  if it cannot contact the JT or if the task does not know its Id it should fail, no?)

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721411#action_12721411 ] 

Ravi Gummadi commented on HADOOP-6072:
--------------------------------------

In general, I think the size of this file distcp_src_files would not consume many hdfs blocks space.

WIth thousands of nodes in cluster(say 4000), even sqrt of getMaxMapTasks() would be 89(i.e. sqrt(8000)), which is a big number for replication. Is that still OK fornamenode's perf with many distcp jobs running parallelly, each creating this file with this many replicas ?


> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721897#action_12721897 ] 

Doug Cutting commented on HADOOP-6072:
--------------------------------------

> Would it be better if distcp sets dfs.client.max.block.acquire.failures to sqrt(maxMapsOnCluster)

The hope is that by increasing the replication there won't be many failures: it shouldn't have to try every replica.  So 3 should probably still be fine, I think.

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Gummadi updated HADOOP-6072:
---------------------------------

    Attachment: d_replica_srcfilelist.patch

Attaching patch for increasing the replication of _distcp_src_files to sqrt(maxMapsOnCluster).

Please review and provide your comments.

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721883#action_12721883 ] 

Ravi Gummadi commented on HADOOP-6072:
--------------------------------------

Doug,   Would it be better if distcp sets dfs.client.max.block.acquire.failures to sqrt(maxMapsOnCluster) as DFSClient.DFSInputStream.chooseDataNode() compares number of failures with this config property(default value of 3) ? But we need that only for this file _distcp_src_files.

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721896#action_12721896 ] 

Doug Cutting commented on HADOOP-6072:
--------------------------------------

Some comments:
 - 'sleeptime' should be 'getSleeptime()' to be thread safe, no?  or maybe use int as a sleep time, since updates to an int are atomic.
 - getNumRunningMaps() is expensive to call from each node at each interval, since reports for all tasks must be retrieved from the JT.  better would be to just fetch the job's counters each time, since they're constant-sized, not proportional to the number of tasks.  You'd need to add a maps_completed counter, then use the difference between that and TOTAL_LAUNCHED_MAPS to calculate the number running.
 - the interval to contact the JT might be randomized a bit, so that not all tasks hit it at the same time, e.g., by adding a random value that's 10% of the specified value.
 - when InterruptedException is caught a thread should generally exit, not simply log a warning.  if things will no longer work correctly without the thread, then it should somehow cause other threads dependent threads to fail too.
 - getNumRunningMaps() should either return a correct value or throw an exception.  if it cannot contact the JT or if the task does not know its Id it should fail, no?

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Gummadi updated HADOOP-6072:
---------------------------------

    Attachment: d_replica_srcfilelist_v1.patch

Attaching patch with suggested changes.

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>             Fix For: 0.21.0
>
>         Attachments: d_replica_srcfilelist.patch, d_replica_srcfilelist_v1.patch
>
>
> When large number of files are being copied by distcp, accessing distcp_src_files seems to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0, Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could solve the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.