You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Matt K <ma...@gmail.com> on 2015/06/16 07:11:57 UTC

tasks stuck in UNASSIGNED state

Hi all,

I'm dealing with a production issue, any help would be appreciated. I am
seeing very strange behavior in the TaskTrackers. After they pick up the
task, it never comes out of the UNASSIGNED state, and the task just gets
killed 10 minutes later.

2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
task's state:UNASSIGNED
2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201506152116_0046_m_000286_0: Task
attempt_201506152116_0046_m_000286_0 failed to report status for 600
seconds. Killing!

Normally, I would see the following in the logs:

2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots

However, it doesn't get this far for these particular tasks. I am perusing
the source code here, and this doesn't seem to be possible:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch

The code does something like this:

    public void addToTaskQueue(LaunchTaskAction action) {
      synchronized (tasksToLaunch) {
        TaskInProgress tip = registerTask(action, this);
        tasksToLaunch.add(tip);
        tasksToLaunch.notifyAll();
      }
    }

The following should pick it up:

    public void run() {
      while (!Thread.interrupted()) {
        try {
          TaskInProgress tip;
          Task task;
          synchronized (tasksToLaunch) {
            while (tasksToLaunch.isEmpty()) {
              tasksToLaunch.wait();
            }
            //get the TIP
            tip = tasksToLaunch.remove(0);
            task = tip.getTask();
            LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
                     " which needs " + task.getNumSlotsRequired() + " slots");
          }

What's even stranger is that this is happening for Map tasks only.
Reduce tasks are fine.

This is only happening on a handful of the nodes, but enough to either
slow down jobs or cause them to fail.

We're running Hadoop 2.3.0-cdh5.0.2

Thanks,

-Matt

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Looks like I'm hitting this issue:
https://issues.apache.org/jira/browse/HDFS-7005

It seems we had a networking hiccup and Hadoop didn't handle it well, with
thread getting stuck on a socket.

On Tue, Jun 16, 2015 at 12:31 PM, Matt K <ma...@gmail.com> wrote:

> It's getting stuck while initializing distributed cache. Why I'm not sure.
>
> On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:
>
>> I made a little bit of progress with investigation.
>>
>> TaskTracker has 2 consumer threads to kick off tasks - one for mappers
>> and another for reducers. The mapper thread got stuck, so the consumer
>> isn't picking up any more tasks, and they are timing out after 10 minutes.
>>
>> Looks like it got stuck here:
>>
>> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>>
>> And here's the stack-trace. From that it looks like it got stuck
>> listening on a socket.
>>
>> Thread 9335: (state = IN_NATIVE)
>>
>>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
>> (Compiled frame; information may be imprecise)
>>
>>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
>> frame)
>>
>>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>>
>>  -
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
>> int, long) @bci=46, line=335 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
>> int) @bci=80, line=157 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
>> @bci=6, line=161 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
>> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
>> @bci=3, line=102 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
>> @bci=27, line=138 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
>> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
>> line=686 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int) @bci=141, line=799 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
>> line=840 (Interpreted frame)
>>
>>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
>> frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
>> @bci=12, line=112 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
>> @bci=9, line=289 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
>> line=1968 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
>> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
>> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
>> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
>> @bci=150, line=328 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
>> @bci=18, line=346 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
>> @bci=51, line=391 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
>> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
>> @bci=115, line=231 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
>> (Interpreted frame)
>>
>>  -
>> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>>
>>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
>> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>> @bci=14, line=1548 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
>> org.apache.hadoop.mapred.TaskTracker$RunningJob,
>> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=96, line=1362 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=2, line=2790 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
>> line=2754 (Interpreted frame)
>> There aren't any tasks running on that node when the thread-dump was
>> taken, so I'm pretty sure it's stuck here. However, there's a socket
>> timeout of 10 seconds, so I don't understand why it didn't time out.
>>
>>
>> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>>
>>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>>> nataraj.jonnalagadda@gmail.com> wrote:
>>>
>>>> Hey Matt,
>>>>
>>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>>
>>>>
>>>> Thanks,
>>>> Nat.
>>>>
>>>>
>>>>
>>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>>> not yet clear to me where it got stuck exactly.
>>>>>
>>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>>> killed 10 minutes later.
>>>>>>
>>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>>> task's state:UNASSIGNED
>>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>>> seconds. Killing!
>>>>>>
>>>>>> Normally, I would see the following in the logs:
>>>>>>
>>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>>
>>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>>
>>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>>
>>>>>> The code does something like this:
>>>>>>
>>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>>       synchronized (tasksToLaunch) {
>>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>>         tasksToLaunch.add(tip);
>>>>>>         tasksToLaunch.notifyAll();
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>> The following should pick it up:
>>>>>>
>>>>>>     public void run() {
>>>>>>       while (!Thread.interrupted()) {
>>>>>>         try {
>>>>>>           TaskInProgress tip;
>>>>>>           Task task;
>>>>>>           synchronized (tasksToLaunch) {
>>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>>               tasksToLaunch.wait();
>>>>>>             }
>>>>>>             //get the TIP
>>>>>>             tip = tasksToLaunch.remove(0);
>>>>>>             task = tip.getTask();
>>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>>           }
>>>>>>
>>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>>
>>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>>
>>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Matt
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.calcmachine.com - easy online calculator.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Looks like I'm hitting this issue:
https://issues.apache.org/jira/browse/HDFS-7005

It seems we had a networking hiccup and Hadoop didn't handle it well, with
thread getting stuck on a socket.

On Tue, Jun 16, 2015 at 12:31 PM, Matt K <ma...@gmail.com> wrote:

> It's getting stuck while initializing distributed cache. Why I'm not sure.
>
> On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:
>
>> I made a little bit of progress with investigation.
>>
>> TaskTracker has 2 consumer threads to kick off tasks - one for mappers
>> and another for reducers. The mapper thread got stuck, so the consumer
>> isn't picking up any more tasks, and they are timing out after 10 minutes.
>>
>> Looks like it got stuck here:
>>
>> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>>
>> And here's the stack-trace. From that it looks like it got stuck
>> listening on a socket.
>>
>> Thread 9335: (state = IN_NATIVE)
>>
>>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
>> (Compiled frame; information may be imprecise)
>>
>>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
>> frame)
>>
>>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>>
>>  -
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
>> int, long) @bci=46, line=335 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
>> int) @bci=80, line=157 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
>> @bci=6, line=161 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
>> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
>> @bci=3, line=102 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
>> @bci=27, line=138 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
>> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
>> line=686 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int) @bci=141, line=799 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
>> line=840 (Interpreted frame)
>>
>>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
>> frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
>> @bci=12, line=112 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
>> @bci=9, line=289 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
>> line=1968 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
>> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
>> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
>> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
>> @bci=150, line=328 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
>> @bci=18, line=346 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
>> @bci=51, line=391 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
>> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
>> @bci=115, line=231 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
>> (Interpreted frame)
>>
>>  -
>> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>>
>>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
>> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>> @bci=14, line=1548 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
>> org.apache.hadoop.mapred.TaskTracker$RunningJob,
>> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=96, line=1362 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=2, line=2790 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
>> line=2754 (Interpreted frame)
>> There aren't any tasks running on that node when the thread-dump was
>> taken, so I'm pretty sure it's stuck here. However, there's a socket
>> timeout of 10 seconds, so I don't understand why it didn't time out.
>>
>>
>> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>>
>>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>>> nataraj.jonnalagadda@gmail.com> wrote:
>>>
>>>> Hey Matt,
>>>>
>>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>>
>>>>
>>>> Thanks,
>>>> Nat.
>>>>
>>>>
>>>>
>>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>>> not yet clear to me where it got stuck exactly.
>>>>>
>>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>>> killed 10 minutes later.
>>>>>>
>>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>>> task's state:UNASSIGNED
>>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>>> seconds. Killing!
>>>>>>
>>>>>> Normally, I would see the following in the logs:
>>>>>>
>>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>>
>>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>>
>>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>>
>>>>>> The code does something like this:
>>>>>>
>>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>>       synchronized (tasksToLaunch) {
>>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>>         tasksToLaunch.add(tip);
>>>>>>         tasksToLaunch.notifyAll();
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>> The following should pick it up:
>>>>>>
>>>>>>     public void run() {
>>>>>>       while (!Thread.interrupted()) {
>>>>>>         try {
>>>>>>           TaskInProgress tip;
>>>>>>           Task task;
>>>>>>           synchronized (tasksToLaunch) {
>>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>>               tasksToLaunch.wait();
>>>>>>             }
>>>>>>             //get the TIP
>>>>>>             tip = tasksToLaunch.remove(0);
>>>>>>             task = tip.getTask();
>>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>>           }
>>>>>>
>>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>>
>>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>>
>>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Matt
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.calcmachine.com - easy online calculator.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Looks like I'm hitting this issue:
https://issues.apache.org/jira/browse/HDFS-7005

It seems we had a networking hiccup and Hadoop didn't handle it well, with
thread getting stuck on a socket.

On Tue, Jun 16, 2015 at 12:31 PM, Matt K <ma...@gmail.com> wrote:

> It's getting stuck while initializing distributed cache. Why I'm not sure.
>
> On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:
>
>> I made a little bit of progress with investigation.
>>
>> TaskTracker has 2 consumer threads to kick off tasks - one for mappers
>> and another for reducers. The mapper thread got stuck, so the consumer
>> isn't picking up any more tasks, and they are timing out after 10 minutes.
>>
>> Looks like it got stuck here:
>>
>> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>>
>> And here's the stack-trace. From that it looks like it got stuck
>> listening on a socket.
>>
>> Thread 9335: (state = IN_NATIVE)
>>
>>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
>> (Compiled frame; information may be imprecise)
>>
>>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
>> frame)
>>
>>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>>
>>  -
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
>> int, long) @bci=46, line=335 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
>> int) @bci=80, line=157 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
>> @bci=6, line=161 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
>> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
>> @bci=3, line=102 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
>> @bci=27, line=138 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
>> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
>> line=686 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int) @bci=141, line=799 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
>> line=840 (Interpreted frame)
>>
>>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
>> frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
>> @bci=12, line=112 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
>> @bci=9, line=289 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
>> line=1968 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
>> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
>> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
>> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
>> @bci=150, line=328 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
>> @bci=18, line=346 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
>> @bci=51, line=391 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
>> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
>> @bci=115, line=231 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
>> (Interpreted frame)
>>
>>  -
>> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>>
>>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
>> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>> @bci=14, line=1548 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
>> org.apache.hadoop.mapred.TaskTracker$RunningJob,
>> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=96, line=1362 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=2, line=2790 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
>> line=2754 (Interpreted frame)
>> There aren't any tasks running on that node when the thread-dump was
>> taken, so I'm pretty sure it's stuck here. However, there's a socket
>> timeout of 10 seconds, so I don't understand why it didn't time out.
>>
>>
>> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>>
>>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>>> nataraj.jonnalagadda@gmail.com> wrote:
>>>
>>>> Hey Matt,
>>>>
>>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>>
>>>>
>>>> Thanks,
>>>> Nat.
>>>>
>>>>
>>>>
>>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>>> not yet clear to me where it got stuck exactly.
>>>>>
>>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>>> killed 10 minutes later.
>>>>>>
>>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>>> task's state:UNASSIGNED
>>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>>> seconds. Killing!
>>>>>>
>>>>>> Normally, I would see the following in the logs:
>>>>>>
>>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>>
>>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>>
>>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>>
>>>>>> The code does something like this:
>>>>>>
>>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>>       synchronized (tasksToLaunch) {
>>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>>         tasksToLaunch.add(tip);
>>>>>>         tasksToLaunch.notifyAll();
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>> The following should pick it up:
>>>>>>
>>>>>>     public void run() {
>>>>>>       while (!Thread.interrupted()) {
>>>>>>         try {
>>>>>>           TaskInProgress tip;
>>>>>>           Task task;
>>>>>>           synchronized (tasksToLaunch) {
>>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>>               tasksToLaunch.wait();
>>>>>>             }
>>>>>>             //get the TIP
>>>>>>             tip = tasksToLaunch.remove(0);
>>>>>>             task = tip.getTask();
>>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>>           }
>>>>>>
>>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>>
>>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>>
>>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Matt
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.calcmachine.com - easy online calculator.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Looks like I'm hitting this issue:
https://issues.apache.org/jira/browse/HDFS-7005

It seems we had a networking hiccup and Hadoop didn't handle it well, with
thread getting stuck on a socket.

On Tue, Jun 16, 2015 at 12:31 PM, Matt K <ma...@gmail.com> wrote:

> It's getting stuck while initializing distributed cache. Why I'm not sure.
>
> On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:
>
>> I made a little bit of progress with investigation.
>>
>> TaskTracker has 2 consumer threads to kick off tasks - one for mappers
>> and another for reducers. The mapper thread got stuck, so the consumer
>> isn't picking up any more tasks, and they are timing out after 10 minutes.
>>
>> Looks like it got stuck here:
>>
>> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>>
>> And here's the stack-trace. From that it looks like it got stuck
>> listening on a socket.
>>
>> Thread 9335: (state = IN_NATIVE)
>>
>>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
>> (Compiled frame; information may be imprecise)
>>
>>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
>> frame)
>>
>>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
>> (Interpreted frame)
>>
>>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>>
>>  -
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
>> int, long) @bci=46, line=335 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
>> int) @bci=80, line=157 (Interpreted frame)
>>
>>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
>> @bci=6, line=161 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
>> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
>> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
>> @bci=3, line=102 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
>> @bci=27, line=138 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
>> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
>> line=686 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
>> int, int) @bci=141, line=799 (Interpreted frame)
>>
>>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
>> line=840 (Interpreted frame)
>>
>>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
>> frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>>
>>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
>> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
>> @bci=12, line=112 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, boolean,
>> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
>> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
>> @bci=9, line=289 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
>> line=1968 (Interpreted frame)
>>
>>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
>> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
>> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
>> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
>> @bci=150, line=328 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
>> @bci=18, line=346 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
>> @bci=51, line=391 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
>> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
>> (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
>> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
>> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
>> @bci=115, line=231 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
>> (Interpreted frame)
>>
>>  -
>> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>>
>>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
>> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
>> frame)
>>
>>  -
>> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>> @bci=14, line=1548 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
>> org.apache.hadoop.mapred.TaskTracker$RunningJob,
>> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=96, line=1362 (Interpreted frame)
>>
>>  -
>> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
>> @bci=2, line=2790 (Interpreted frame)
>>
>>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
>> line=2754 (Interpreted frame)
>> There aren't any tasks running on that node when the thread-dump was
>> taken, so I'm pretty sure it's stuck here. However, there's a socket
>> timeout of 10 seconds, so I don't understand why it didn't time out.
>>
>>
>> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>>
>>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>>> nataraj.jonnalagadda@gmail.com> wrote:
>>>
>>>> Hey Matt,
>>>>
>>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>>
>>>>
>>>> Thanks,
>>>> Nat.
>>>>
>>>>
>>>>
>>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>>> not yet clear to me where it got stuck exactly.
>>>>>
>>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>>> killed 10 minutes later.
>>>>>>
>>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>>> task's state:UNASSIGNED
>>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>>> seconds. Killing!
>>>>>>
>>>>>> Normally, I would see the following in the logs:
>>>>>>
>>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>>
>>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>>
>>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>>
>>>>>> The code does something like this:
>>>>>>
>>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>>       synchronized (tasksToLaunch) {
>>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>>         tasksToLaunch.add(tip);
>>>>>>         tasksToLaunch.notifyAll();
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>> The following should pick it up:
>>>>>>
>>>>>>     public void run() {
>>>>>>       while (!Thread.interrupted()) {
>>>>>>         try {
>>>>>>           TaskInProgress tip;
>>>>>>           Task task;
>>>>>>           synchronized (tasksToLaunch) {
>>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>>               tasksToLaunch.wait();
>>>>>>             }
>>>>>>             //get the TIP
>>>>>>             tip = tasksToLaunch.remove(0);
>>>>>>             task = tip.getTask();
>>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>>           }
>>>>>>
>>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>>
>>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>>
>>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Matt
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> www.calcmachine.com - easy online calculator.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

It's getting stuck while initializing distributed cache. Why I'm not sure.

On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:

> I made a little bit of progress with investigation.
>
> TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
> another for reducers. The mapper thread got stuck, so the consumer isn't
> picking up any more tasks, and they are timing out after 10 minutes.
>
> Looks like it got stuck here:
>
> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>
> And here's the stack-trace. From that it looks like it got stuck listening
> on a socket.
>
> Thread 9335: (state = IN_NATIVE)
>
>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
> (Compiled frame; information may be imprecise)
>
>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
> frame)
>
>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>
>  -
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
> int, long) @bci=46, line=335 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
> int) @bci=80, line=157 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
> @bci=6, line=161 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
> @bci=3, line=102 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
> @bci=27, line=138 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
> line=686 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int) @bci=141, line=799 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
> line=840 (Interpreted frame)
>
>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
> frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
> @bci=12, line=112 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
> @bci=9, line=289 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
> line=1968 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
> (Interpreted frame)
>
>  -
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>
>  -
> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
> @bci=150, line=328 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
> @bci=18, line=346 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
> @bci=51, line=391 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
> @bci=115, line=231 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
> (Interpreted frame)
>
>  -
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
> @bci=14, line=1548 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
> org.apache.hadoop.mapred.TaskTracker$RunningJob,
> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=96, line=1362 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=2, line=2790 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
> line=2754 (Interpreted frame)
> There aren't any tasks running on that node when the thread-dump was
> taken, so I'm pretty sure it's stuck here. However, there's a socket
> timeout of 10 seconds, so I don't understand why it didn't time out.
>
>
> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>
>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>> nataraj.jonnalagadda@gmail.com> wrote:
>>
>>> Hey Matt,
>>>
>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>
>>>
>>> Thanks,
>>> Nat.
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>> not yet clear to me where it got stuck exactly.
>>>>
>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>> killed 10 minutes later.
>>>>>
>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>> task's state:UNASSIGNED
>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>> seconds. Killing!
>>>>>
>>>>> Normally, I would see the following in the logs:
>>>>>
>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>
>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>
>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>
>>>>> The code does something like this:
>>>>>
>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>       synchronized (tasksToLaunch) {
>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>         tasksToLaunch.add(tip);
>>>>>         tasksToLaunch.notifyAll();
>>>>>       }
>>>>>     }
>>>>>
>>>>> The following should pick it up:
>>>>>
>>>>>     public void run() {
>>>>>       while (!Thread.interrupted()) {
>>>>>         try {
>>>>>           TaskInProgress tip;
>>>>>           Task task;
>>>>>           synchronized (tasksToLaunch) {
>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>               tasksToLaunch.wait();
>>>>>             }
>>>>>             //get the TIP
>>>>>             tip = tasksToLaunch.remove(0);
>>>>>             task = tip.getTask();
>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>           }
>>>>>
>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>
>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>
>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Matt
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.calcmachine.com - easy online calculator.
>>>>
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

It's getting stuck while initializing distributed cache. Why I'm not sure.

On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:

> I made a little bit of progress with investigation.
>
> TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
> another for reducers. The mapper thread got stuck, so the consumer isn't
> picking up any more tasks, and they are timing out after 10 minutes.
>
> Looks like it got stuck here:
>
> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>
> And here's the stack-trace. From that it looks like it got stuck listening
> on a socket.
>
> Thread 9335: (state = IN_NATIVE)
>
>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
> (Compiled frame; information may be imprecise)
>
>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
> frame)
>
>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>
>  -
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
> int, long) @bci=46, line=335 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
> int) @bci=80, line=157 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
> @bci=6, line=161 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
> @bci=3, line=102 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
> @bci=27, line=138 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
> line=686 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int) @bci=141, line=799 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
> line=840 (Interpreted frame)
>
>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
> frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
> @bci=12, line=112 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
> @bci=9, line=289 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
> line=1968 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
> (Interpreted frame)
>
>  -
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>
>  -
> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
> @bci=150, line=328 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
> @bci=18, line=346 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
> @bci=51, line=391 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
> @bci=115, line=231 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
> (Interpreted frame)
>
>  -
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
> @bci=14, line=1548 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
> org.apache.hadoop.mapred.TaskTracker$RunningJob,
> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=96, line=1362 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=2, line=2790 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
> line=2754 (Interpreted frame)
> There aren't any tasks running on that node when the thread-dump was
> taken, so I'm pretty sure it's stuck here. However, there's a socket
> timeout of 10 seconds, so I don't understand why it didn't time out.
>
>
> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>
>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>> nataraj.jonnalagadda@gmail.com> wrote:
>>
>>> Hey Matt,
>>>
>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>
>>>
>>> Thanks,
>>> Nat.
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>> not yet clear to me where it got stuck exactly.
>>>>
>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>> killed 10 minutes later.
>>>>>
>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>> task's state:UNASSIGNED
>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>> seconds. Killing!
>>>>>
>>>>> Normally, I would see the following in the logs:
>>>>>
>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>
>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>
>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>
>>>>> The code does something like this:
>>>>>
>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>       synchronized (tasksToLaunch) {
>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>         tasksToLaunch.add(tip);
>>>>>         tasksToLaunch.notifyAll();
>>>>>       }
>>>>>     }
>>>>>
>>>>> The following should pick it up:
>>>>>
>>>>>     public void run() {
>>>>>       while (!Thread.interrupted()) {
>>>>>         try {
>>>>>           TaskInProgress tip;
>>>>>           Task task;
>>>>>           synchronized (tasksToLaunch) {
>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>               tasksToLaunch.wait();
>>>>>             }
>>>>>             //get the TIP
>>>>>             tip = tasksToLaunch.remove(0);
>>>>>             task = tip.getTask();
>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>           }
>>>>>
>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>
>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>
>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Matt
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.calcmachine.com - easy online calculator.
>>>>
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

It's getting stuck while initializing distributed cache. Why I'm not sure.

On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:

> I made a little bit of progress with investigation.
>
> TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
> another for reducers. The mapper thread got stuck, so the consumer isn't
> picking up any more tasks, and they are timing out after 10 minutes.
>
> Looks like it got stuck here:
>
> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>
> And here's the stack-trace. From that it looks like it got stuck listening
> on a socket.
>
> Thread 9335: (state = IN_NATIVE)
>
>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
> (Compiled frame; information may be imprecise)
>
>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
> frame)
>
>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>
>  -
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
> int, long) @bci=46, line=335 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
> int) @bci=80, line=157 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
> @bci=6, line=161 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
> @bci=3, line=102 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
> @bci=27, line=138 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
> line=686 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int) @bci=141, line=799 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
> line=840 (Interpreted frame)
>
>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
> frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
> @bci=12, line=112 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
> @bci=9, line=289 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
> line=1968 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
> (Interpreted frame)
>
>  -
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>
>  -
> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
> @bci=150, line=328 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
> @bci=18, line=346 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
> @bci=51, line=391 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
> @bci=115, line=231 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
> (Interpreted frame)
>
>  -
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
> @bci=14, line=1548 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
> org.apache.hadoop.mapred.TaskTracker$RunningJob,
> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=96, line=1362 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=2, line=2790 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
> line=2754 (Interpreted frame)
> There aren't any tasks running on that node when the thread-dump was
> taken, so I'm pretty sure it's stuck here. However, there's a socket
> timeout of 10 seconds, so I don't understand why it didn't time out.
>
>
> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>
>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>> nataraj.jonnalagadda@gmail.com> wrote:
>>
>>> Hey Matt,
>>>
>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>
>>>
>>> Thanks,
>>> Nat.
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>> not yet clear to me where it got stuck exactly.
>>>>
>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>> killed 10 minutes later.
>>>>>
>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>> task's state:UNASSIGNED
>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>> seconds. Killing!
>>>>>
>>>>> Normally, I would see the following in the logs:
>>>>>
>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>
>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>
>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>
>>>>> The code does something like this:
>>>>>
>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>       synchronized (tasksToLaunch) {
>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>         tasksToLaunch.add(tip);
>>>>>         tasksToLaunch.notifyAll();
>>>>>       }
>>>>>     }
>>>>>
>>>>> The following should pick it up:
>>>>>
>>>>>     public void run() {
>>>>>       while (!Thread.interrupted()) {
>>>>>         try {
>>>>>           TaskInProgress tip;
>>>>>           Task task;
>>>>>           synchronized (tasksToLaunch) {
>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>               tasksToLaunch.wait();
>>>>>             }
>>>>>             //get the TIP
>>>>>             tip = tasksToLaunch.remove(0);
>>>>>             task = tip.getTask();
>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>           }
>>>>>
>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>
>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>
>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Matt
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.calcmachine.com - easy online calculator.
>>>>
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

It's getting stuck while initializing distributed cache. Why I'm not sure.

On Tue, Jun 16, 2015 at 12:18 PM, Matt K <ma...@gmail.com> wrote:

> I made a little bit of progress with investigation.
>
> TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
> another for reducers. The mapper thread got stuck, so the consumer isn't
> picking up any more tasks, and they are timing out after 10 minutes.
>
> Looks like it got stuck here:
>
> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>
> And here's the stack-trace. From that it looks like it got stuck listening
> on a socket.
>
> Thread 9335: (state = IN_NATIVE)
>
>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
> (Compiled frame; information may be imprecise)
>
>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
> frame)
>
>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>
>  -
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
> int, long) @bci=46, line=335 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
> int) @bci=80, line=157 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
> @bci=6, line=161 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
> @bci=3, line=102 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
> @bci=27, line=138 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
> line=686 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int) @bci=141, line=799 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
> line=840 (Interpreted frame)
>
>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
> frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
> @bci=12, line=112 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
> @bci=9, line=289 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
> line=1968 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
> (Interpreted frame)
>
>  -
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>
>  -
> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
> @bci=150, line=328 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
> @bci=18, line=346 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
> @bci=51, line=391 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
> @bci=115, line=231 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
> (Interpreted frame)
>
>  -
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
> @bci=14, line=1548 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
> org.apache.hadoop.mapred.TaskTracker$RunningJob,
> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=96, line=1362 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=2, line=2790 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
> line=2754 (Interpreted frame)
> There aren't any tasks running on that node when the thread-dump was
> taken, so I'm pretty sure it's stuck here. However, there's a socket
> timeout of 10 seconds, so I don't understand why it didn't time out.
>
>
> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>
>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>> nataraj.jonnalagadda@gmail.com> wrote:
>>
>>> Hey Matt,
>>>
>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>
>>>
>>> Thanks,
>>> Nat.
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>> not yet clear to me where it got stuck exactly.
>>>>
>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>> killed 10 minutes later.
>>>>>
>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>> task's state:UNASSIGNED
>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>> seconds. Killing!
>>>>>
>>>>> Normally, I would see the following in the logs:
>>>>>
>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>>
>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>
>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>
>>>>> The code does something like this:
>>>>>
>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>       synchronized (tasksToLaunch) {
>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>         tasksToLaunch.add(tip);
>>>>>         tasksToLaunch.notifyAll();
>>>>>       }
>>>>>     }
>>>>>
>>>>> The following should pick it up:
>>>>>
>>>>>     public void run() {
>>>>>       while (!Thread.interrupted()) {
>>>>>         try {
>>>>>           TaskInProgress tip;
>>>>>           Task task;
>>>>>           synchronized (tasksToLaunch) {
>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>               tasksToLaunch.wait();
>>>>>             }
>>>>>             //get the TIP
>>>>>             tip = tasksToLaunch.remove(0);
>>>>>             task = tip.getTask();
>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>>           }
>>>>>
>>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>>
>>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>>
>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Matt
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.calcmachine.com - easy online calculator.
>>>>
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I made a little bit of progress with investigation.

TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
another for reducers. The mapper thread got stuck, so the consumer isn't
picking up any more tasks, and they are timing out after 10 minutes.

Looks like it got stuck here:
https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754

And here's the stack-trace. From that it looks like it got stuck listening
on a socket.

Thread 9335: (state = IN_NATIVE)

 - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
(Compiled frame; information may be imprecise)

 - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
frame)

 - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)

 -
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
int, long) @bci=46, line=335 (Interpreted frame)

 - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int)
@bci=80, line=157 (Interpreted frame)

 - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
@bci=6, line=161 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
java.io.InputStream) @bci=293, line=171 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
@bci=3, line=102 (Interpreted frame)

 - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
@bci=27, line=138 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
line=686 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int) @bci=141, line=799 (Interpreted frame)

 - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
line=840 (Interpreted frame)

 - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
@bci=12, line=112 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
@bci=9, line=289 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
line=1968 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
(Interpreted frame)

 -
org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)

 -
org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
(Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
@bci=150, line=328 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
@bci=18, line=346 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
@bci=51, line=391 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
(Interpreted frame)

 -
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
@bci=115, line=231 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
(Interpreted frame)

 -
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
java.security.AccessControlContext) @bci=0 (Interpreted frame)

 - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
frame)

 -
org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
@bci=14, line=1548 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
org.apache.hadoop.mapred.TaskTracker$RunningJob,
java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=96, line=1362 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=2, line=2790 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
line=2754 (Interpreted frame)
There aren't any tasks running on that node when the thread-dump was taken,
so I'm pretty sure it's stuck here. However, there's a socket timeout of 10
seconds, so I don't understand why it didn't time out.


On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:

> Hi Nat, I should've mentioned this before. We're running MRv1.
>
> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
> nataraj.jonnalagadda@gmail.com> wrote:
>
>> Hey Matt,
>>
>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
>> scheduler config or Cgroups not (incase enabled) set up not correctly. We
>> could dig in more if we have the yarn-site.xml and scheduler conf files.
>>
>>
>> Thanks,
>> Nat.
>>
>>
>>
>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>
>>> I see there's 2 threads - one that kicks off the mappers, and another
>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>> not yet clear to me where it got stuck exactly.
>>>
>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>> killed 10 minutes later.
>>>>
>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>> task's state:UNASSIGNED
>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201506152116_0046_m_000286_0: Task
>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>> seconds. Killing!
>>>>
>>>> Normally, I would see the following in the logs:
>>>>
>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>
>>>> However, it doesn't get this far for these particular tasks. I am
>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>
>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>
>>>> The code does something like this:
>>>>
>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>       synchronized (tasksToLaunch) {
>>>>         TaskInProgress tip = registerTask(action, this);
>>>>         tasksToLaunch.add(tip);
>>>>         tasksToLaunch.notifyAll();
>>>>       }
>>>>     }
>>>>
>>>> The following should pick it up:
>>>>
>>>>     public void run() {
>>>>       while (!Thread.interrupted()) {
>>>>         try {
>>>>           TaskInProgress tip;
>>>>           Task task;
>>>>           synchronized (tasksToLaunch) {
>>>>             while (tasksToLaunch.isEmpty()) {
>>>>               tasksToLaunch.wait();
>>>>             }
>>>>             //get the TIP
>>>>             tip = tasksToLaunch.remove(0);
>>>>             task = tip.getTask();
>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>           }
>>>>
>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>
>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>
>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>
>>>> Thanks,
>>>>
>>>> -Matt
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I made a little bit of progress with investigation.

TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
another for reducers. The mapper thread got stuck, so the consumer isn't
picking up any more tasks, and they are timing out after 10 minutes.

Looks like it got stuck here:
https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754

And here's the stack-trace. From that it looks like it got stuck listening
on a socket.

Thread 9335: (state = IN_NATIVE)

 - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
(Compiled frame; information may be imprecise)

 - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
frame)

 - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)

 -
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
int, long) @bci=46, line=335 (Interpreted frame)

 - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int)
@bci=80, line=157 (Interpreted frame)

 - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
@bci=6, line=161 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
java.io.InputStream) @bci=293, line=171 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
@bci=3, line=102 (Interpreted frame)

 - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
@bci=27, line=138 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
line=686 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int) @bci=141, line=799 (Interpreted frame)

 - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
line=840 (Interpreted frame)

 - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
@bci=12, line=112 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
@bci=9, line=289 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
line=1968 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
(Interpreted frame)

 -
org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)

 -
org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
(Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
@bci=150, line=328 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
@bci=18, line=346 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
@bci=51, line=391 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
(Interpreted frame)

 -
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
@bci=115, line=231 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
(Interpreted frame)

 -
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
java.security.AccessControlContext) @bci=0 (Interpreted frame)

 - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
frame)

 -
org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
@bci=14, line=1548 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
org.apache.hadoop.mapred.TaskTracker$RunningJob,
java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=96, line=1362 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=2, line=2790 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
line=2754 (Interpreted frame)
There aren't any tasks running on that node when the thread-dump was taken,
so I'm pretty sure it's stuck here. However, there's a socket timeout of 10
seconds, so I don't understand why it didn't time out.


On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:

> Hi Nat, I should've mentioned this before. We're running MRv1.
>
> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
> nataraj.jonnalagadda@gmail.com> wrote:
>
>> Hey Matt,
>>
>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
>> scheduler config or Cgroups not (incase enabled) set up not correctly. We
>> could dig in more if we have the yarn-site.xml and scheduler conf files.
>>
>>
>> Thanks,
>> Nat.
>>
>>
>>
>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>
>>> I see there's 2 threads - one that kicks off the mappers, and another
>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>> not yet clear to me where it got stuck exactly.
>>>
>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>> killed 10 minutes later.
>>>>
>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>> task's state:UNASSIGNED
>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201506152116_0046_m_000286_0: Task
>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>> seconds. Killing!
>>>>
>>>> Normally, I would see the following in the logs:
>>>>
>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>
>>>> However, it doesn't get this far for these particular tasks. I am
>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>
>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>
>>>> The code does something like this:
>>>>
>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>       synchronized (tasksToLaunch) {
>>>>         TaskInProgress tip = registerTask(action, this);
>>>>         tasksToLaunch.add(tip);
>>>>         tasksToLaunch.notifyAll();
>>>>       }
>>>>     }
>>>>
>>>> The following should pick it up:
>>>>
>>>>     public void run() {
>>>>       while (!Thread.interrupted()) {
>>>>         try {
>>>>           TaskInProgress tip;
>>>>           Task task;
>>>>           synchronized (tasksToLaunch) {
>>>>             while (tasksToLaunch.isEmpty()) {
>>>>               tasksToLaunch.wait();
>>>>             }
>>>>             //get the TIP
>>>>             tip = tasksToLaunch.remove(0);
>>>>             task = tip.getTask();
>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>           }
>>>>
>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>
>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>
>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>
>>>> Thanks,
>>>>
>>>> -Matt
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I made a little bit of progress with investigation.

TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
another for reducers. The mapper thread got stuck, so the consumer isn't
picking up any more tasks, and they are timing out after 10 minutes.

Looks like it got stuck here:
https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754

And here's the stack-trace. From that it looks like it got stuck listening
on a socket.

Thread 9335: (state = IN_NATIVE)

 - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
(Compiled frame; information may be imprecise)

 - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
frame)

 - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)

 -
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
int, long) @bci=46, line=335 (Interpreted frame)

 - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int)
@bci=80, line=157 (Interpreted frame)

 - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
@bci=6, line=161 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
java.io.InputStream) @bci=293, line=171 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
@bci=3, line=102 (Interpreted frame)

 - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
@bci=27, line=138 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
line=686 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int) @bci=141, line=799 (Interpreted frame)

 - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
line=840 (Interpreted frame)

 - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
@bci=12, line=112 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
@bci=9, line=289 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
line=1968 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
(Interpreted frame)

 -
org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)

 -
org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
(Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
@bci=150, line=328 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
@bci=18, line=346 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
@bci=51, line=391 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
(Interpreted frame)

 -
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
@bci=115, line=231 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
(Interpreted frame)

 -
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
java.security.AccessControlContext) @bci=0 (Interpreted frame)

 - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
frame)

 -
org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
@bci=14, line=1548 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
org.apache.hadoop.mapred.TaskTracker$RunningJob,
java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=96, line=1362 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=2, line=2790 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
line=2754 (Interpreted frame)
There aren't any tasks running on that node when the thread-dump was taken,
so I'm pretty sure it's stuck here. However, there's a socket timeout of 10
seconds, so I don't understand why it didn't time out.


On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:

> Hi Nat, I should've mentioned this before. We're running MRv1.
>
> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
> nataraj.jonnalagadda@gmail.com> wrote:
>
>> Hey Matt,
>>
>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
>> scheduler config or Cgroups not (incase enabled) set up not correctly. We
>> could dig in more if we have the yarn-site.xml and scheduler conf files.
>>
>>
>> Thanks,
>> Nat.
>>
>>
>>
>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>
>>> I see there's 2 threads - one that kicks off the mappers, and another
>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>> not yet clear to me where it got stuck exactly.
>>>
>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>> killed 10 minutes later.
>>>>
>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>> task's state:UNASSIGNED
>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201506152116_0046_m_000286_0: Task
>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>> seconds. Killing!
>>>>
>>>> Normally, I would see the following in the logs:
>>>>
>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>
>>>> However, it doesn't get this far for these particular tasks. I am
>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>
>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>
>>>> The code does something like this:
>>>>
>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>       synchronized (tasksToLaunch) {
>>>>         TaskInProgress tip = registerTask(action, this);
>>>>         tasksToLaunch.add(tip);
>>>>         tasksToLaunch.notifyAll();
>>>>       }
>>>>     }
>>>>
>>>> The following should pick it up:
>>>>
>>>>     public void run() {
>>>>       while (!Thread.interrupted()) {
>>>>         try {
>>>>           TaskInProgress tip;
>>>>           Task task;
>>>>           synchronized (tasksToLaunch) {
>>>>             while (tasksToLaunch.isEmpty()) {
>>>>               tasksToLaunch.wait();
>>>>             }
>>>>             //get the TIP
>>>>             tip = tasksToLaunch.remove(0);
>>>>             task = tip.getTask();
>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>           }
>>>>
>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>
>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>
>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>
>>>> Thanks,
>>>>
>>>> -Matt
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I made a little bit of progress with investigation.

TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
another for reducers. The mapper thread got stuck, so the consumer isn't
picking up any more tasks, and they are timing out after 10 minutes.

Looks like it got stuck here:
https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754

And here's the stack-trace. From that it looks like it got stuck listening
on a socket.

Thread 9335: (state = IN_NATIVE)

 - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
(Compiled frame; information may be imprecise)

 - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
frame)

 - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
(Interpreted frame)

 - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)

 -
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
int, long) @bci=46, line=335 (Interpreted frame)

 - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int)
@bci=80, line=157 (Interpreted frame)

 - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
@bci=6, line=161 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
java.io.InputStream) @bci=293, line=171 (Interpreted frame)

 -
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
@bci=3, line=102 (Interpreted frame)

 - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
@bci=27, line=138 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
line=686 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)

 -
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
int, int) @bci=141, line=799 (Interpreted frame)

 - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
line=840 (Interpreted frame)

 - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)

 - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
@bci=12, line=112 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, boolean,
org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)

 - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
@bci=9, line=289 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
line=1968 (Interpreted frame)

 - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
(Interpreted frame)

 -
org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)

 -
org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
(Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
@bci=150, line=328 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
@bci=18, line=346 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
@bci=51, line=391 (Interpreted frame)

 -
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
(Interpreted frame)

 -
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
@bci=115, line=231 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
(Interpreted frame)

 -
java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
java.security.AccessControlContext) @bci=0 (Interpreted frame)

 - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
frame)

 -
org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
@bci=14, line=1548 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
org.apache.hadoop.mapred.TaskTracker$RunningJob,
java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=96, line=1362 (Interpreted frame)

 -
org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
@bci=2, line=2790 (Interpreted frame)

 - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
line=2754 (Interpreted frame)
There aren't any tasks running on that node when the thread-dump was taken,
so I'm pretty sure it's stuck here. However, there's a socket timeout of 10
seconds, so I don't understand why it didn't time out.


On Tue, Jun 16, 2015 at 7:51 AM, Matt K <ma...@gmail.com> wrote:

> Hi Nat, I should've mentioned this before. We're running MRv1.
>
> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
> nataraj.jonnalagadda@gmail.com> wrote:
>
>> Hey Matt,
>>
>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
>> scheduler config or Cgroups not (incase enabled) set up not correctly. We
>> could dig in more if we have the yarn-site.xml and scheduler conf files.
>>
>>
>> Thanks,
>> Nat.
>>
>>
>>
>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>>
>>> I see there's 2 threads - one that kicks off the mappers, and another
>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>> not yet clear to me where it got stuck exactly.
>>>
>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>> am seeing very strange behavior in the TaskTrackers. After they pick up the
>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>> killed 10 minutes later.
>>>>
>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>> task's state:UNASSIGNED
>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_201506152116_0046_m_000286_0: Task
>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>> seconds. Killing!
>>>>
>>>> Normally, I would see the following in the logs:
>>>>
>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>>
>>>> However, it doesn't get this far for these particular tasks. I am
>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>
>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>
>>>> The code does something like this:
>>>>
>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>       synchronized (tasksToLaunch) {
>>>>         TaskInProgress tip = registerTask(action, this);
>>>>         tasksToLaunch.add(tip);
>>>>         tasksToLaunch.notifyAll();
>>>>       }
>>>>     }
>>>>
>>>> The following should pick it up:
>>>>
>>>>     public void run() {
>>>>       while (!Thread.interrupted()) {
>>>>         try {
>>>>           TaskInProgress tip;
>>>>           Task task;
>>>>           synchronized (tasksToLaunch) {
>>>>             while (tasksToLaunch.isEmpty()) {
>>>>               tasksToLaunch.wait();
>>>>             }
>>>>             //get the TIP
>>>>             tip = tasksToLaunch.remove(0);
>>>>             task = tip.getTask();
>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>>           }
>>>>
>>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>>
>>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>>
>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>
>>>> Thanks,
>>>>
>>>> -Matt
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Hi Nat, I should've mentioned this before. We're running MRv1.

On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
nataraj.jonnalagadda@gmail.com> wrote:

> Hey Matt,
>
> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
> scheduler config or Cgroups not (incase enabled) set up not correctly. We
> could dig in more if we have the yarn-site.xml and scheduler conf files.
>
>
> Thanks,
> Nat.
>
>
>
> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>
>> I see there's 2 threads - one that kicks off the mappers, and another
>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>> not yet clear to me where it got stuck exactly.
>>
>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm dealing with a production issue, any help would be appreciated. I am
>>> seeing very strange behavior in the TaskTrackers. After they pick up the
>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>> killed 10 minutes later.
>>>
>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>> task's state:UNASSIGNED
>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201506152116_0046_m_000286_0: Task
>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>> seconds. Killing!
>>>
>>> Normally, I would see the following in the logs:
>>>
>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>
>>> However, it doesn't get this far for these particular tasks. I am
>>> perusing the source code here, and this doesn't seem to be possible:
>>>
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>
>>> The code does something like this:
>>>
>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>       synchronized (tasksToLaunch) {
>>>         TaskInProgress tip = registerTask(action, this);
>>>         tasksToLaunch.add(tip);
>>>         tasksToLaunch.notifyAll();
>>>       }
>>>     }
>>>
>>> The following should pick it up:
>>>
>>>     public void run() {
>>>       while (!Thread.interrupted()) {
>>>         try {
>>>           TaskInProgress tip;
>>>           Task task;
>>>           synchronized (tasksToLaunch) {
>>>             while (tasksToLaunch.isEmpty()) {
>>>               tasksToLaunch.wait();
>>>             }
>>>             //get the TIP
>>>             tip = tasksToLaunch.remove(0);
>>>             task = tip.getTask();
>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>           }
>>>
>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>
>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>
>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>
>>> Thanks,
>>>
>>> -Matt
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Hi Nat, I should've mentioned this before. We're running MRv1.

On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
nataraj.jonnalagadda@gmail.com> wrote:

> Hey Matt,
>
> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
> scheduler config or Cgroups not (incase enabled) set up not correctly. We
> could dig in more if we have the yarn-site.xml and scheduler conf files.
>
>
> Thanks,
> Nat.
>
>
>
> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>
>> I see there's 2 threads - one that kicks off the mappers, and another
>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>> not yet clear to me where it got stuck exactly.
>>
>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm dealing with a production issue, any help would be appreciated. I am
>>> seeing very strange behavior in the TaskTrackers. After they pick up the
>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>> killed 10 minutes later.
>>>
>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>> task's state:UNASSIGNED
>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201506152116_0046_m_000286_0: Task
>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>> seconds. Killing!
>>>
>>> Normally, I would see the following in the logs:
>>>
>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>
>>> However, it doesn't get this far for these particular tasks. I am
>>> perusing the source code here, and this doesn't seem to be possible:
>>>
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>
>>> The code does something like this:
>>>
>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>       synchronized (tasksToLaunch) {
>>>         TaskInProgress tip = registerTask(action, this);
>>>         tasksToLaunch.add(tip);
>>>         tasksToLaunch.notifyAll();
>>>       }
>>>     }
>>>
>>> The following should pick it up:
>>>
>>>     public void run() {
>>>       while (!Thread.interrupted()) {
>>>         try {
>>>           TaskInProgress tip;
>>>           Task task;
>>>           synchronized (tasksToLaunch) {
>>>             while (tasksToLaunch.isEmpty()) {
>>>               tasksToLaunch.wait();
>>>             }
>>>             //get the TIP
>>>             tip = tasksToLaunch.remove(0);
>>>             task = tip.getTask();
>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>           }
>>>
>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>
>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>
>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>
>>> Thanks,
>>>
>>> -Matt
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Hi Nat, I should've mentioned this before. We're running MRv1.

On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
nataraj.jonnalagadda@gmail.com> wrote:

> Hey Matt,
>
> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
> scheduler config or Cgroups not (incase enabled) set up not correctly. We
> could dig in more if we have the yarn-site.xml and scheduler conf files.
>
>
> Thanks,
> Nat.
>
>
>
> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>
>> I see there's 2 threads - one that kicks off the mappers, and another
>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>> not yet clear to me where it got stuck exactly.
>>
>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm dealing with a production issue, any help would be appreciated. I am
>>> seeing very strange behavior in the TaskTrackers. After they pick up the
>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>> killed 10 minutes later.
>>>
>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>> task's state:UNASSIGNED
>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201506152116_0046_m_000286_0: Task
>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>> seconds. Killing!
>>>
>>> Normally, I would see the following in the logs:
>>>
>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>
>>> However, it doesn't get this far for these particular tasks. I am
>>> perusing the source code here, and this doesn't seem to be possible:
>>>
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>
>>> The code does something like this:
>>>
>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>       synchronized (tasksToLaunch) {
>>>         TaskInProgress tip = registerTask(action, this);
>>>         tasksToLaunch.add(tip);
>>>         tasksToLaunch.notifyAll();
>>>       }
>>>     }
>>>
>>> The following should pick it up:
>>>
>>>     public void run() {
>>>       while (!Thread.interrupted()) {
>>>         try {
>>>           TaskInProgress tip;
>>>           Task task;
>>>           synchronized (tasksToLaunch) {
>>>             while (tasksToLaunch.isEmpty()) {
>>>               tasksToLaunch.wait();
>>>             }
>>>             //get the TIP
>>>             tip = tasksToLaunch.remove(0);
>>>             task = tip.getTask();
>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>           }
>>>
>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>
>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>
>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>
>>> Thanks,
>>>
>>> -Matt
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

Hi Nat, I should've mentioned this before. We're running MRv1.

On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
nataraj.jonnalagadda@gmail.com> wrote:

> Hey Matt,
>
> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
> scheduler config or Cgroups not (incase enabled) set up not correctly. We
> could dig in more if we have the yarn-site.xml and scheduler conf files.
>
>
> Thanks,
> Nat.
>
>
>
> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:
>
>> I see there's 2 threads - one that kicks off the mappers, and another
>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>> not yet clear to me where it got stuck exactly.
>>
>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm dealing with a production issue, any help would be appreciated. I am
>>> seeing very strange behavior in the TaskTrackers. After they pick up the
>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>> killed 10 minutes later.
>>>
>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>> task's state:UNASSIGNED
>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201506152116_0046_m_000286_0: Task
>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>> seconds. Killing!
>>>
>>> Normally, I would see the following in the logs:
>>>
>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>>
>>> However, it doesn't get this far for these particular tasks. I am
>>> perusing the source code here, and this doesn't seem to be possible:
>>>
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>
>>> The code does something like this:
>>>
>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>       synchronized (tasksToLaunch) {
>>>         TaskInProgress tip = registerTask(action, this);
>>>         tasksToLaunch.add(tip);
>>>         tasksToLaunch.notifyAll();
>>>       }
>>>     }
>>>
>>> The following should pick it up:
>>>
>>>     public void run() {
>>>       while (!Thread.interrupted()) {
>>>         try {
>>>           TaskInProgress tip;
>>>           Task task;
>>>           synchronized (tasksToLaunch) {
>>>             while (tasksToLaunch.isEmpty()) {
>>>               tasksToLaunch.wait();
>>>             }
>>>             //get the TIP
>>>             tip = tasksToLaunch.remove(0);
>>>             task = tip.getTask();
>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>>           }
>>>
>>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>>
>>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>>
>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>
>>> Thanks,
>>>
>>> -Matt
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by nataraj jonnalagadda <na...@gmail.com>.

Hey Matt,

Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
scheduler config or Cgroups not (incase enabled) set up not correctly. We
could dig in more if we have the yarn-site.xml and scheduler conf files.


Thanks,
Nat.



On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:

> I see there's 2 threads - one that kicks off the mappers, and another that
> kicks off reducers. The one that kicks off the mappers got stuck. It's not
> yet clear to me where it got stuck exactly.
>
> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm dealing with a production issue, any help would be appreciated. I am
>> seeing very strange behavior in the TaskTrackers. After they pick up the
>> task, it never comes out of the UNASSIGNED state, and the task just gets
>> killed 10 minutes later.
>>
>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>> task's state:UNASSIGNED
>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201506152116_0046_m_000286_0: Task
>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>> seconds. Killing!
>>
>> Normally, I would see the following in the logs:
>>
>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
>> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>
>> However, it doesn't get this far for these particular tasks. I am
>> perusing the source code here, and this doesn't seem to be possible:
>>
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>
>> The code does something like this:
>>
>>     public void addToTaskQueue(LaunchTaskAction action) {
>>       synchronized (tasksToLaunch) {
>>         TaskInProgress tip = registerTask(action, this);
>>         tasksToLaunch.add(tip);
>>         tasksToLaunch.notifyAll();
>>       }
>>     }
>>
>> The following should pick it up:
>>
>>     public void run() {
>>       while (!Thread.interrupted()) {
>>         try {
>>           TaskInProgress tip;
>>           Task task;
>>           synchronized (tasksToLaunch) {
>>             while (tasksToLaunch.isEmpty()) {
>>               tasksToLaunch.wait();
>>             }
>>             //get the TIP
>>             tip = tasksToLaunch.remove(0);
>>             task = tip.getTask();
>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>           }
>>
>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>
>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>
>> We're running Hadoop 2.3.0-cdh5.0.2
>>
>> Thanks,
>>
>> -Matt
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>

Re: tasks stuck in UNASSIGNED state

Posted by nataraj jonnalagadda <na...@gmail.com>.

Hey Matt,

Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
scheduler config or Cgroups not (incase enabled) set up not correctly. We
could dig in more if we have the yarn-site.xml and scheduler conf files.


Thanks,
Nat.



On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:

> I see there's 2 threads - one that kicks off the mappers, and another that
> kicks off reducers. The one that kicks off the mappers got stuck. It's not
> yet clear to me where it got stuck exactly.
>
> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm dealing with a production issue, any help would be appreciated. I am
>> seeing very strange behavior in the TaskTrackers. After they pick up the
>> task, it never comes out of the UNASSIGNED state, and the task just gets
>> killed 10 minutes later.
>>
>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>> task's state:UNASSIGNED
>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201506152116_0046_m_000286_0: Task
>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>> seconds. Killing!
>>
>> Normally, I would see the following in the logs:
>>
>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
>> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>
>> However, it doesn't get this far for these particular tasks. I am
>> perusing the source code here, and this doesn't seem to be possible:
>>
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>
>> The code does something like this:
>>
>>     public void addToTaskQueue(LaunchTaskAction action) {
>>       synchronized (tasksToLaunch) {
>>         TaskInProgress tip = registerTask(action, this);
>>         tasksToLaunch.add(tip);
>>         tasksToLaunch.notifyAll();
>>       }
>>     }
>>
>> The following should pick it up:
>>
>>     public void run() {
>>       while (!Thread.interrupted()) {
>>         try {
>>           TaskInProgress tip;
>>           Task task;
>>           synchronized (tasksToLaunch) {
>>             while (tasksToLaunch.isEmpty()) {
>>               tasksToLaunch.wait();
>>             }
>>             //get the TIP
>>             tip = tasksToLaunch.remove(0);
>>             task = tip.getTask();
>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>           }
>>
>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>
>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>
>> We're running Hadoop 2.3.0-cdh5.0.2
>>
>> Thanks,
>>
>> -Matt
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>

Re: tasks stuck in UNASSIGNED state

Posted by nataraj jonnalagadda <na...@gmail.com>.

Hey Matt,

Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
scheduler config or Cgroups not (incase enabled) set up not correctly. We
could dig in more if we have the yarn-site.xml and scheduler conf files.


Thanks,
Nat.



On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:

> I see there's 2 threads - one that kicks off the mappers, and another that
> kicks off reducers. The one that kicks off the mappers got stuck. It's not
> yet clear to me where it got stuck exactly.
>
> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm dealing with a production issue, any help would be appreciated. I am
>> seeing very strange behavior in the TaskTrackers. After they pick up the
>> task, it never comes out of the UNASSIGNED state, and the task just gets
>> killed 10 minutes later.
>>
>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>> task's state:UNASSIGNED
>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201506152116_0046_m_000286_0: Task
>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>> seconds. Killing!
>>
>> Normally, I would see the following in the logs:
>>
>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
>> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>
>> However, it doesn't get this far for these particular tasks. I am
>> perusing the source code here, and this doesn't seem to be possible:
>>
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>
>> The code does something like this:
>>
>>     public void addToTaskQueue(LaunchTaskAction action) {
>>       synchronized (tasksToLaunch) {
>>         TaskInProgress tip = registerTask(action, this);
>>         tasksToLaunch.add(tip);
>>         tasksToLaunch.notifyAll();
>>       }
>>     }
>>
>> The following should pick it up:
>>
>>     public void run() {
>>       while (!Thread.interrupted()) {
>>         try {
>>           TaskInProgress tip;
>>           Task task;
>>           synchronized (tasksToLaunch) {
>>             while (tasksToLaunch.isEmpty()) {
>>               tasksToLaunch.wait();
>>             }
>>             //get the TIP
>>             tip = tasksToLaunch.remove(0);
>>             task = tip.getTask();
>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>           }
>>
>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>
>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>
>> We're running Hadoop 2.3.0-cdh5.0.2
>>
>> Thanks,
>>
>> -Matt
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>

Re: tasks stuck in UNASSIGNED state

Posted by nataraj jonnalagadda <na...@gmail.com>.

Hey Matt,

Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN
scheduler config or Cgroups not (incase enabled) set up not correctly. We
could dig in more if we have the yarn-site.xml and scheduler conf files.


Thanks,
Nat.



On Mon, Jun 15, 2015 at 10:39 PM, Matt K <ma...@gmail.com> wrote:

> I see there's 2 threads - one that kicks off the mappers, and another that
> kicks off reducers. The one that kicks off the mappers got stuck. It's not
> yet clear to me where it got stuck exactly.
>
> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm dealing with a production issue, any help would be appreciated. I am
>> seeing very strange behavior in the TaskTrackers. After they pick up the
>> task, it never comes out of the UNASSIGNED state, and the task just gets
>> killed 10 minutes later.
>>
>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>> task's state:UNASSIGNED
>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201506152116_0046_m_000286_0: Task
>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>> seconds. Killing!
>>
>> Normally, I would see the following in the logs:
>>
>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
>> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>>
>> However, it doesn't get this far for these particular tasks. I am
>> perusing the source code here, and this doesn't seem to be possible:
>>
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>
>> The code does something like this:
>>
>>     public void addToTaskQueue(LaunchTaskAction action) {
>>       synchronized (tasksToLaunch) {
>>         TaskInProgress tip = registerTask(action, this);
>>         tasksToLaunch.add(tip);
>>         tasksToLaunch.notifyAll();
>>       }
>>     }
>>
>> The following should pick it up:
>>
>>     public void run() {
>>       while (!Thread.interrupted()) {
>>         try {
>>           TaskInProgress tip;
>>           Task task;
>>           synchronized (tasksToLaunch) {
>>             while (tasksToLaunch.isEmpty()) {
>>               tasksToLaunch.wait();
>>             }
>>             //get the TIP
>>             tip = tasksToLaunch.remove(0);
>>             task = tip.getTask();
>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>                      " which needs " + task.getNumSlotsRequired() + " slots");
>>           }
>>
>> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>>
>> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>>
>> We're running Hadoop 2.3.0-cdh5.0.2
>>
>> Thanks,
>>
>> -Matt
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I see there's 2 threads - one that kicks off the mappers, and another that
kicks off reducers. The one that kicks off the mappers got stuck. It's not
yet clear to me where it got stuck exactly.

On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:

> Hi all,
>
> I'm dealing with a production issue, any help would be appreciated. I am
> seeing very strange behavior in the TaskTrackers. After they pick up the
> task, it never comes out of the UNASSIGNED state, and the task just gets
> killed 10 minutes later.
>
> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
> task's state:UNASSIGNED
> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201506152116_0046_m_000286_0: Task
> attempt_201506152116_0046_m_000286_0 failed to report status for 600
> seconds. Killing!
>
> Normally, I would see the following in the logs:
>
> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>
> However, it doesn't get this far for these particular tasks. I am perusing
> the source code here, and this doesn't seem to be possible:
>
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>
> The code does something like this:
>
>     public void addToTaskQueue(LaunchTaskAction action) {
>       synchronized (tasksToLaunch) {
>         TaskInProgress tip = registerTask(action, this);
>         tasksToLaunch.add(tip);
>         tasksToLaunch.notifyAll();
>       }
>     }
>
> The following should pick it up:
>
>     public void run() {
>       while (!Thread.interrupted()) {
>         try {
>           TaskInProgress tip;
>           Task task;
>           synchronized (tasksToLaunch) {
>             while (tasksToLaunch.isEmpty()) {
>               tasksToLaunch.wait();
>             }
>             //get the TIP
>             tip = tasksToLaunch.remove(0);
>             task = tip.getTask();
>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>                      " which needs " + task.getNumSlotsRequired() + " slots");
>           }
>
> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>
> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>
> We're running Hadoop 2.3.0-cdh5.0.2
>
> Thanks,
>
> -Matt
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I see there's 2 threads - one that kicks off the mappers, and another that
kicks off reducers. The one that kicks off the mappers got stuck. It's not
yet clear to me where it got stuck exactly.

On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:

> Hi all,
>
> I'm dealing with a production issue, any help would be appreciated. I am
> seeing very strange behavior in the TaskTrackers. After they pick up the
> task, it never comes out of the UNASSIGNED state, and the task just gets
> killed 10 minutes later.
>
> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
> task's state:UNASSIGNED
> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201506152116_0046_m_000286_0: Task
> attempt_201506152116_0046_m_000286_0 failed to report status for 600
> seconds. Killing!
>
> Normally, I would see the following in the logs:
>
> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>
> However, it doesn't get this far for these particular tasks. I am perusing
> the source code here, and this doesn't seem to be possible:
>
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>
> The code does something like this:
>
>     public void addToTaskQueue(LaunchTaskAction action) {
>       synchronized (tasksToLaunch) {
>         TaskInProgress tip = registerTask(action, this);
>         tasksToLaunch.add(tip);
>         tasksToLaunch.notifyAll();
>       }
>     }
>
> The following should pick it up:
>
>     public void run() {
>       while (!Thread.interrupted()) {
>         try {
>           TaskInProgress tip;
>           Task task;
>           synchronized (tasksToLaunch) {
>             while (tasksToLaunch.isEmpty()) {
>               tasksToLaunch.wait();
>             }
>             //get the TIP
>             tip = tasksToLaunch.remove(0);
>             task = tip.getTask();
>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>                      " which needs " + task.getNumSlotsRequired() + " slots");
>           }
>
> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>
> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>
> We're running Hadoop 2.3.0-cdh5.0.2
>
> Thanks,
>
> -Matt
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I see there's 2 threads - one that kicks off the mappers, and another that
kicks off reducers. The one that kicks off the mappers got stuck. It's not
yet clear to me where it got stuck exactly.

On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:

> Hi all,
>
> I'm dealing with a production issue, any help would be appreciated. I am
> seeing very strange behavior in the TaskTrackers. After they pick up the
> task, it never comes out of the UNASSIGNED state, and the task just gets
> killed 10 minutes later.
>
> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
> task's state:UNASSIGNED
> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201506152116_0046_m_000286_0: Task
> attempt_201506152116_0046_m_000286_0 failed to report status for 600
> seconds. Killing!
>
> Normally, I would see the following in the logs:
>
> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>
> However, it doesn't get this far for these particular tasks. I am perusing
> the source code here, and this doesn't seem to be possible:
>
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>
> The code does something like this:
>
>     public void addToTaskQueue(LaunchTaskAction action) {
>       synchronized (tasksToLaunch) {
>         TaskInProgress tip = registerTask(action, this);
>         tasksToLaunch.add(tip);
>         tasksToLaunch.notifyAll();
>       }
>     }
>
> The following should pick it up:
>
>     public void run() {
>       while (!Thread.interrupted()) {
>         try {
>           TaskInProgress tip;
>           Task task;
>           synchronized (tasksToLaunch) {
>             while (tasksToLaunch.isEmpty()) {
>               tasksToLaunch.wait();
>             }
>             //get the TIP
>             tip = tasksToLaunch.remove(0);
>             task = tip.getTask();
>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>                      " which needs " + task.getNumSlotsRequired() + " slots");
>           }
>
> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>
> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>
> We're running Hadoop 2.3.0-cdh5.0.2
>
> Thanks,
>
> -Matt
>
>


-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Posted by Matt K <ma...@gmail.com>.

I see there's 2 threads - one that kicks off the mappers, and another that
kicks off reducers. The one that kicks off the mappers got stuck. It's not
yet clear to me where it got stuck exactly.

On Tue, Jun 16, 2015 at 1:11 AM, Matt K <ma...@gmail.com> wrote:

> Hi all,
>
> I'm dealing with a production issue, any help would be appreciated. I am
> seeing very strange behavior in the TaskTrackers. After they pick up the
> task, it never comes out of the UNASSIGNED state, and the task just gets
> killed 10 minutes later.
>
> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
> task's state:UNASSIGNED
> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201506152116_0046_m_000286_0: Task
> attempt_201506152116_0046_m_000286_0 failed to report status for 600
> seconds. Killing!
>
> Normally, I would see the following in the logs:
>
> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots
>
> However, it doesn't get this far for these particular tasks. I am perusing
> the source code here, and this doesn't seem to be possible:
>
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>
> The code does something like this:
>
>     public void addToTaskQueue(LaunchTaskAction action) {
>       synchronized (tasksToLaunch) {
>         TaskInProgress tip = registerTask(action, this);
>         tasksToLaunch.add(tip);
>         tasksToLaunch.notifyAll();
>       }
>     }
>
> The following should pick it up:
>
>     public void run() {
>       while (!Thread.interrupted()) {
>         try {
>           TaskInProgress tip;
>           Task task;
>           synchronized (tasksToLaunch) {
>             while (tasksToLaunch.isEmpty()) {
>               tasksToLaunch.wait();
>             }
>             //get the TIP
>             tip = tasksToLaunch.remove(0);
>             task = tip.getTask();
>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>                      " which needs " + task.getNumSlotsRequired() + " slots");
>           }
>
> What's even stranger is that this is happening for Map tasks only. Reduce tasks are fine.
>
> This is only happening on a handful of the nodes, but enough to either slow down jobs or cause them to fail.
>
> We're running Hadoop 2.3.0-cdh5.0.2
>
> Thanks,
>
> -Matt
>
>


-- 
www.calcmachine.com - easy online calculator.