You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by hadoop hive <ha...@gmail.com> on 2014/08/02 17:34:22 UTC

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Can you check the ulimit for tour user. Which might be causing this.
On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:

> Hi everyone,
>
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will
> try and provide as much information as I can here, so if someone can help,
> that would be really great.
>
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>
> • Master node:
>
> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>
> – 64GB DDR3 SDRAM
>
> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>
> • Slave nodes (each):
>
> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>
> – 32GB DDR3 SDRAM
>
> – 4 x 2TB SATA-3 hard drive
>
> • Operating system on all nodes: openSUSE Linux 13.1
>
>  We have the Apache BigTop package version 0.7, with Hadoop version
> 2.0.6-alpha and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
>
> No one else uses this cluster while I am working.
>
> What I’m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using
> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
> create a series of tables from these files and finally run a Python script
> on one of them to perform some scientific calculations. The files are
> .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format
> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
> avoid the “small files problem.”
>
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for
> much larger numbers, doing select * works fine, which means the files are
> being read properly, but if I do something as simple as selecting just one
> column from the whole table for a larger number of files, containers start
> being killed and jobs fail with this error in the container logs:
>
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
>
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
>
> For 5000 files (1gb compressed), the selection of a single column
> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
> 4% map and then errors out.
>
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated
> to Hadoop on each slave node.
>
> This is the output of netstat –i while the column selection is running:
>
> Kernel Interface table
>
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
> TX-OVR Flg
>
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>   0 BMRU
>
> eth1   1500   0 77410508      0      0      0 40815746      0      0
> 0 BMRU
>
> lo    65536   0 16593808      0      0      0 16593808      0      0
> 0 LRU
>
>
>
>
> Are there some settings I am missing that mean the cluster isn’t
> processing this data as efficiently as it can?
>
> I am very new to Hadoop and there are so many logs, etc, that
> troubleshooting can be a bit overwhelming. Where else should I be looking
> to try and diagnose what is wrong?
>
> Thanks in advance for any help you can give!
>
> Kind regards,
> Ana
>
>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Hi,

Unfortunately, after I set my user¹s ulimit ­n to  65536, I still get the
same bad performance, killed containers and errors as before.

I collected together a bunch of logs around the moment when the containers
are being killed (application master log, killed container log, hadoop-hdfs
logs, hadoop-yarn logs) and this is what¹s happening in order. The
PrivilegedActionException, is that something? Is my user not in some sort of
correct group?

Thanks in advance!
Ana

2014-08-04 12:21:47,868 INFO [IPC Server handler 4 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:47,979 INFO [IPC Server handler 5 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_0
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_0 is : 0.061734322
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_0
2014-08-04 12:21:48,110 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocateBlock: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1. BP-1198908146-10.0.0.11-1399888065184
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]}
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.40:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.43:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,146 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.44:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1 is closed by
DFSClient_attempt_1403771939632_0413_m_000006_1_-1142346950_1
2014-08-04 12:21:48,147 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
422 Total time for transactions(ms): 5 Number of transactions batched in
Syncs: 0 Number of syncs: 186 SyncTimes(ms): 204
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 0.93497133
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 1.0
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,177 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from RUNNING
to SUCCESS_CONTAINER_CLEANUP
2014-08-04 12:21:48,177 INFO [IPC Server handler 9 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000016 taskAttempt
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,195 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from RUNNING
to KILL_CONTAINER_CLEANUP
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Issuing kill to other
attempt attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1403771939632_0413_m_000006 Task Transitioned from RUNNING to SUCCEEDED
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000009 taskAttempt
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,219 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_CONTAINER_CLEANUP to KILL_TASK_CLEANUP
2014-08-04 12:21:48,230 INFO [CommitterEvent Processor #1]
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing
the event EventType: TASK_ABORT
2014-08-04 12:21:48,231 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_TASK_CLEANUP to KILLED
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_1 is : 0.047582902
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_1
2014-08-04 12:21:48,370 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:zslf023 (auth:SIMPLE)
cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,370 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.0.0.42:41375: error:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,378 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,379 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,383 INFO [IPC Server handler 11 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_1
2014-08-04 12:21:48,620 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:14
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,198 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000016 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000016
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000016 on node: host:
slave4.hadoop.local:57340 #containers=1 available=24576 used=4096 with
event: FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000016, NodeId: slave4.hadoop.local:57340,
NodeHttpAddress: slave4.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 16, }, state: C_COMPLETE,
diagnostics: "Container killed by the ApplicationMaster.\n\n", exit_status:
143, ] resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:57344,
vCores:14>usedCapacity=0.2857143, absoluteUsedCapacity=0.2857143, numApps=1,
numContainers=14 usedCapacity=0.2857143 absoluteUsedCapacity=0.2857143
used=<memory:57344, vCores:14> cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:57344, vCores:14> numContainers=14 user=zslf023
user-resources=<memory:57344, vCores:14>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.2857143
absoluteUsedCapacity=0.2857143 used=<memory:57344, vCores:14>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000016 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000016 of
capacity <memory:4096, vCores:1> on host slave4.hadoop.local:57340, which
currently has 1 containers, <memory:4096, vCores:1> used and <memory:24576,
vCores:15> available, release resources=true
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000009
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000009 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:53248, vCores:13> numContainers=13 user=zslf023
user-resources=<memory:53248, vCores:13>
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000009 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000009 of
capacity <memory:4096, vCores:1> on host slave2.hadoop.local:51269, which
currently has 6 containers, <memory:24576, vCores:6> used and <memory:4096,
vCores:10> available, release resources=true
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000009 on node: host:
slave2.hadoop.local:51269 #containers=6 available=4096 used=24576 with
event: FINISHED
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000009, NodeId: slave2.hadoop.local:51269,
NodeHttpAddress: slave2.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 9, }, state: C_COMPLETE, diagnostics:
"Container killed by the ApplicationMaster.\n\n", exit_status: 143, ]
resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:53248,
vCores:13>usedCapacity=0.26530612, absoluteUsedCapacity=0.26530612,
numApps=1, numContainers=13 usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,363 INFO [IPC Server handler 12 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000001_1
2014-08-04 12:21:49,502 INFO [IPC Server handler 13 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000003_1
2014-08-04 12:21:49,623 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000016
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_0: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_1: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:12
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000009
2014-08-04 12:21:49,702 INFO [IPC Server handler 14 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000004_1



From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 20:02
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks


From: Ana Gillan <an...@gmail.com>
Date: Saturday, 2 August 2014 18:38
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Filemax across the cluster is set to over 6 million. I¹ve checked the open
> file limits for the accounts used by the Hadoop daemons  and they have an open
> file limit of 32K. This is confirmed by the various .out files, e.g.
> 
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
> 
> Contains open files (-n) 32768. Is this too low? What is the recommended value
> for open files on all nodes? Also does my own user need to have the same
> value?
> 
> I¹ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally 2gb
> compressed), but this job won¹t progress past 0% map.
> 
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> For my own user? It is as follows:
> 
> core file size          (blocks, -c) 0
> 
> data seg size           (kbytes, -d) unlimited
> 
> scheduling priority             (-e) 0
> 
> file size               (blocks, -f) unlimited
> 
> pending signals                 (-i) 483941
> 
> max locked memory       (kbytes, -l) 64
> 
> max memory size         (kbytes, -m) unlimited
> 
> open files                      (-n) 1024
> 
> pipe size            (512 bytes, -p) 8
> 
> POSIX message queues     (bytes, -q) 819200
> 
> real-time priority              (-r) 0
> 
> stack size              (kbytes, -s) 8192
> 
> cpu time               (seconds, -t) unlimited
> 
> max user processes              (-u) 800
> 
> virtual memory          (kbytes, -v) unlimited
> 
> file locks                      (-x) unlimited
> 
> 
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> 
> Can you check the ulimit for tour user. Which might be causing this.
> 
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>> Hi everyone,
>> 
>> I am having an issue with MapReduce jobs running through Hive being killed
>> after 600s timeouts and with very simple jobs taking over 3 hours (or just
>> failing) for a set of files with a compressed size of only 1-2gb. I will try
>> and provide as much information as I can here, so if someone can help, that
>> would be really great.
>> 
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>>> € Master node:
>>> 
>>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>> 
>>> ­ 64GB DDR3 SDRAM
>>> 
>>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>> 
>>> € Slave nodes (each):
>>> 
>>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>> 
>>> ­ 32GB DDR3 SDRAM
>>> 
>>> ­ 4 x 2TB SATA-3 hard drive
>>> 
>>> € Operating system on all nodes: openSUSE Linux 13.1
>> 
>> We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>> 
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>> 
>> No one else uses this cluster while I am working.
>> 
>> What I¹m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using this
>> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
>> series of tables from these files and finally run a Python script on one of
>> them to perform some scientific calculations. The files are .xml.gz format
>> and (uncompressed) are only about 4mb in size each. hive.input.format is set
>> to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the
>> ³small files problem.²
>> 
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for much
>> larger numbers, doing select * works fine, which means the files are being
>> read properly, but if I do something as simple as selecting just one column
>> from the whole table for a larger number of files, containers start being
>> killed and jobs fail with this error in the container logs:
>> 
>> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
>> Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.
>> LeaseExpiredException): No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0: File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.j
>> ava:2398)
>> 
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>> 
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>> 
>> For 5000 files (1gb compressed), the selection of a single column finishes,
>> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
>> then errors out.
>> 
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
>> on each slave node.
>> 
>> This is the output of netstat ­i while the column selection is running:
>> Kernel Interface table
>> 
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR 
>> Flg
>> 
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0 
>> BMRU
>> 
>> eth1   1500   0 77410508      0      0      0 40815746      0      0      0 
>> BMRU
>> 
>> lo    65536   0 16593808      0      0      0 16593808      0      0      0 
>> LRU
>> 
>> 
>> 
>> 
>> 
>> Are there some settings I am missing that mean the cluster isn¹t processing 
>> this data as efficiently as it can?
>> 
>> I am very new to Hadoop and there are so many logs, etc, that troubleshooting 
>> can be a bit overwhelming. Where else should I be looking to try and diagnose 
>> what is wrong?
>> 
>> Thanks in advance for any help you can give!
>> 
>> Kind regards,
>> Ana 
>> 




Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Hi,

Unfortunately, after I set my user¹s ulimit ­n to  65536, I still get the
same bad performance, killed containers and errors as before.

I collected together a bunch of logs around the moment when the containers
are being killed (application master log, killed container log, hadoop-hdfs
logs, hadoop-yarn logs) and this is what¹s happening in order. The
PrivilegedActionException, is that something? Is my user not in some sort of
correct group?

Thanks in advance!
Ana

2014-08-04 12:21:47,868 INFO [IPC Server handler 4 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:47,979 INFO [IPC Server handler 5 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_0
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_0 is : 0.061734322
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_0
2014-08-04 12:21:48,110 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocateBlock: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1. BP-1198908146-10.0.0.11-1399888065184
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]}
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.40:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.43:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,146 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.44:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1 is closed by
DFSClient_attempt_1403771939632_0413_m_000006_1_-1142346950_1
2014-08-04 12:21:48,147 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
422 Total time for transactions(ms): 5 Number of transactions batched in
Syncs: 0 Number of syncs: 186 SyncTimes(ms): 204
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 0.93497133
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 1.0
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,177 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from RUNNING
to SUCCESS_CONTAINER_CLEANUP
2014-08-04 12:21:48,177 INFO [IPC Server handler 9 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000016 taskAttempt
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,195 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from RUNNING
to KILL_CONTAINER_CLEANUP
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Issuing kill to other
attempt attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1403771939632_0413_m_000006 Task Transitioned from RUNNING to SUCCEEDED
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000009 taskAttempt
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,219 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_CONTAINER_CLEANUP to KILL_TASK_CLEANUP
2014-08-04 12:21:48,230 INFO [CommitterEvent Processor #1]
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing
the event EventType: TASK_ABORT
2014-08-04 12:21:48,231 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_TASK_CLEANUP to KILLED
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_1 is : 0.047582902
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_1
2014-08-04 12:21:48,370 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:zslf023 (auth:SIMPLE)
cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,370 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.0.0.42:41375: error:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,378 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,379 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,383 INFO [IPC Server handler 11 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_1
2014-08-04 12:21:48,620 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:14
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,198 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000016 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000016
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000016 on node: host:
slave4.hadoop.local:57340 #containers=1 available=24576 used=4096 with
event: FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000016, NodeId: slave4.hadoop.local:57340,
NodeHttpAddress: slave4.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 16, }, state: C_COMPLETE,
diagnostics: "Container killed by the ApplicationMaster.\n\n", exit_status:
143, ] resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:57344,
vCores:14>usedCapacity=0.2857143, absoluteUsedCapacity=0.2857143, numApps=1,
numContainers=14 usedCapacity=0.2857143 absoluteUsedCapacity=0.2857143
used=<memory:57344, vCores:14> cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:57344, vCores:14> numContainers=14 user=zslf023
user-resources=<memory:57344, vCores:14>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.2857143
absoluteUsedCapacity=0.2857143 used=<memory:57344, vCores:14>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000016 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000016 of
capacity <memory:4096, vCores:1> on host slave4.hadoop.local:57340, which
currently has 1 containers, <memory:4096, vCores:1> used and <memory:24576,
vCores:15> available, release resources=true
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000009
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000009 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:53248, vCores:13> numContainers=13 user=zslf023
user-resources=<memory:53248, vCores:13>
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000009 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000009 of
capacity <memory:4096, vCores:1> on host slave2.hadoop.local:51269, which
currently has 6 containers, <memory:24576, vCores:6> used and <memory:4096,
vCores:10> available, release resources=true
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000009 on node: host:
slave2.hadoop.local:51269 #containers=6 available=4096 used=24576 with
event: FINISHED
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000009, NodeId: slave2.hadoop.local:51269,
NodeHttpAddress: slave2.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 9, }, state: C_COMPLETE, diagnostics:
"Container killed by the ApplicationMaster.\n\n", exit_status: 143, ]
resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:53248,
vCores:13>usedCapacity=0.26530612, absoluteUsedCapacity=0.26530612,
numApps=1, numContainers=13 usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,363 INFO [IPC Server handler 12 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000001_1
2014-08-04 12:21:49,502 INFO [IPC Server handler 13 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000003_1
2014-08-04 12:21:49,623 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000016
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_0: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_1: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:12
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000009
2014-08-04 12:21:49,702 INFO [IPC Server handler 14 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000004_1



From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 20:02
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks


From: Ana Gillan <an...@gmail.com>
Date: Saturday, 2 August 2014 18:38
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Filemax across the cluster is set to over 6 million. I¹ve checked the open
> file limits for the accounts used by the Hadoop daemons  and they have an open
> file limit of 32K. This is confirmed by the various .out files, e.g.
> 
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
> 
> Contains open files (-n) 32768. Is this too low? What is the recommended value
> for open files on all nodes? Also does my own user need to have the same
> value?
> 
> I¹ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally 2gb
> compressed), but this job won¹t progress past 0% map.
> 
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> For my own user? It is as follows:
> 
> core file size          (blocks, -c) 0
> 
> data seg size           (kbytes, -d) unlimited
> 
> scheduling priority             (-e) 0
> 
> file size               (blocks, -f) unlimited
> 
> pending signals                 (-i) 483941
> 
> max locked memory       (kbytes, -l) 64
> 
> max memory size         (kbytes, -m) unlimited
> 
> open files                      (-n) 1024
> 
> pipe size            (512 bytes, -p) 8
> 
> POSIX message queues     (bytes, -q) 819200
> 
> real-time priority              (-r) 0
> 
> stack size              (kbytes, -s) 8192
> 
> cpu time               (seconds, -t) unlimited
> 
> max user processes              (-u) 800
> 
> virtual memory          (kbytes, -v) unlimited
> 
> file locks                      (-x) unlimited
> 
> 
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> 
> Can you check the ulimit for tour user. Which might be causing this.
> 
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>> Hi everyone,
>> 
>> I am having an issue with MapReduce jobs running through Hive being killed
>> after 600s timeouts and with very simple jobs taking over 3 hours (or just
>> failing) for a set of files with a compressed size of only 1-2gb. I will try
>> and provide as much information as I can here, so if someone can help, that
>> would be really great.
>> 
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>>> € Master node:
>>> 
>>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>> 
>>> ­ 64GB DDR3 SDRAM
>>> 
>>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>> 
>>> € Slave nodes (each):
>>> 
>>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>> 
>>> ­ 32GB DDR3 SDRAM
>>> 
>>> ­ 4 x 2TB SATA-3 hard drive
>>> 
>>> € Operating system on all nodes: openSUSE Linux 13.1
>> 
>> We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>> 
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>> 
>> No one else uses this cluster while I am working.
>> 
>> What I¹m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using this
>> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
>> series of tables from these files and finally run a Python script on one of
>> them to perform some scientific calculations. The files are .xml.gz format
>> and (uncompressed) are only about 4mb in size each. hive.input.format is set
>> to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the
>> ³small files problem.²
>> 
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for much
>> larger numbers, doing select * works fine, which means the files are being
>> read properly, but if I do something as simple as selecting just one column
>> from the whole table for a larger number of files, containers start being
>> killed and jobs fail with this error in the container logs:
>> 
>> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
>> Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.
>> LeaseExpiredException): No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0: File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.j
>> ava:2398)
>> 
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>> 
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>> 
>> For 5000 files (1gb compressed), the selection of a single column finishes,
>> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
>> then errors out.
>> 
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
>> on each slave node.
>> 
>> This is the output of netstat ­i while the column selection is running:
>> Kernel Interface table
>> 
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR 
>> Flg
>> 
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0 
>> BMRU
>> 
>> eth1   1500   0 77410508      0      0      0 40815746      0      0      0 
>> BMRU
>> 
>> lo    65536   0 16593808      0      0      0 16593808      0      0      0 
>> LRU
>> 
>> 
>> 
>> 
>> 
>> Are there some settings I am missing that mean the cluster isn¹t processing 
>> this data as efficiently as it can?
>> 
>> I am very new to Hadoop and there are so many logs, etc, that troubleshooting 
>> can be a bit overwhelming. Where else should I be looking to try and diagnose 
>> what is wrong?
>> 
>> Thanks in advance for any help you can give!
>> 
>> Kind regards,
>> Ana 
>> 




Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Hi,

Unfortunately, after I set my user¹s ulimit ­n to  65536, I still get the
same bad performance, killed containers and errors as before.

I collected together a bunch of logs around the moment when the containers
are being killed (application master log, killed container log, hadoop-hdfs
logs, hadoop-yarn logs) and this is what¹s happening in order. The
PrivilegedActionException, is that something? Is my user not in some sort of
correct group?

Thanks in advance!
Ana

2014-08-04 12:21:47,868 INFO [IPC Server handler 4 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:47,979 INFO [IPC Server handler 5 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_0
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_0 is : 0.061734322
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_0
2014-08-04 12:21:48,110 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocateBlock: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1. BP-1198908146-10.0.0.11-1399888065184
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]}
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.40:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.43:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,146 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.44:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1 is closed by
DFSClient_attempt_1403771939632_0413_m_000006_1_-1142346950_1
2014-08-04 12:21:48,147 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
422 Total time for transactions(ms): 5 Number of transactions batched in
Syncs: 0 Number of syncs: 186 SyncTimes(ms): 204
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 0.93497133
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 1.0
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,177 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from RUNNING
to SUCCESS_CONTAINER_CLEANUP
2014-08-04 12:21:48,177 INFO [IPC Server handler 9 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000016 taskAttempt
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,195 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from RUNNING
to KILL_CONTAINER_CLEANUP
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Issuing kill to other
attempt attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1403771939632_0413_m_000006 Task Transitioned from RUNNING to SUCCEEDED
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000009 taskAttempt
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,219 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_CONTAINER_CLEANUP to KILL_TASK_CLEANUP
2014-08-04 12:21:48,230 INFO [CommitterEvent Processor #1]
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing
the event EventType: TASK_ABORT
2014-08-04 12:21:48,231 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_TASK_CLEANUP to KILLED
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_1 is : 0.047582902
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_1
2014-08-04 12:21:48,370 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:zslf023 (auth:SIMPLE)
cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,370 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.0.0.42:41375: error:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,378 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,379 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,383 INFO [IPC Server handler 11 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_1
2014-08-04 12:21:48,620 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:14
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,198 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000016 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000016
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000016 on node: host:
slave4.hadoop.local:57340 #containers=1 available=24576 used=4096 with
event: FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000016, NodeId: slave4.hadoop.local:57340,
NodeHttpAddress: slave4.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 16, }, state: C_COMPLETE,
diagnostics: "Container killed by the ApplicationMaster.\n\n", exit_status:
143, ] resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:57344,
vCores:14>usedCapacity=0.2857143, absoluteUsedCapacity=0.2857143, numApps=1,
numContainers=14 usedCapacity=0.2857143 absoluteUsedCapacity=0.2857143
used=<memory:57344, vCores:14> cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:57344, vCores:14> numContainers=14 user=zslf023
user-resources=<memory:57344, vCores:14>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.2857143
absoluteUsedCapacity=0.2857143 used=<memory:57344, vCores:14>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000016 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000016 of
capacity <memory:4096, vCores:1> on host slave4.hadoop.local:57340, which
currently has 1 containers, <memory:4096, vCores:1> used and <memory:24576,
vCores:15> available, release resources=true
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000009
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000009 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:53248, vCores:13> numContainers=13 user=zslf023
user-resources=<memory:53248, vCores:13>
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000009 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000009 of
capacity <memory:4096, vCores:1> on host slave2.hadoop.local:51269, which
currently has 6 containers, <memory:24576, vCores:6> used and <memory:4096,
vCores:10> available, release resources=true
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000009 on node: host:
slave2.hadoop.local:51269 #containers=6 available=4096 used=24576 with
event: FINISHED
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000009, NodeId: slave2.hadoop.local:51269,
NodeHttpAddress: slave2.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 9, }, state: C_COMPLETE, diagnostics:
"Container killed by the ApplicationMaster.\n\n", exit_status: 143, ]
resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:53248,
vCores:13>usedCapacity=0.26530612, absoluteUsedCapacity=0.26530612,
numApps=1, numContainers=13 usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,363 INFO [IPC Server handler 12 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000001_1
2014-08-04 12:21:49,502 INFO [IPC Server handler 13 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000003_1
2014-08-04 12:21:49,623 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000016
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_0: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_1: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:12
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000009
2014-08-04 12:21:49,702 INFO [IPC Server handler 14 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000004_1



From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 20:02
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks


From: Ana Gillan <an...@gmail.com>
Date: Saturday, 2 August 2014 18:38
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Filemax across the cluster is set to over 6 million. I¹ve checked the open
> file limits for the accounts used by the Hadoop daemons  and they have an open
> file limit of 32K. This is confirmed by the various .out files, e.g.
> 
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
> 
> Contains open files (-n) 32768. Is this too low? What is the recommended value
> for open files on all nodes? Also does my own user need to have the same
> value?
> 
> I¹ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally 2gb
> compressed), but this job won¹t progress past 0% map.
> 
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> For my own user? It is as follows:
> 
> core file size          (blocks, -c) 0
> 
> data seg size           (kbytes, -d) unlimited
> 
> scheduling priority             (-e) 0
> 
> file size               (blocks, -f) unlimited
> 
> pending signals                 (-i) 483941
> 
> max locked memory       (kbytes, -l) 64
> 
> max memory size         (kbytes, -m) unlimited
> 
> open files                      (-n) 1024
> 
> pipe size            (512 bytes, -p) 8
> 
> POSIX message queues     (bytes, -q) 819200
> 
> real-time priority              (-r) 0
> 
> stack size              (kbytes, -s) 8192
> 
> cpu time               (seconds, -t) unlimited
> 
> max user processes              (-u) 800
> 
> virtual memory          (kbytes, -v) unlimited
> 
> file locks                      (-x) unlimited
> 
> 
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> 
> Can you check the ulimit for tour user. Which might be causing this.
> 
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>> Hi everyone,
>> 
>> I am having an issue with MapReduce jobs running through Hive being killed
>> after 600s timeouts and with very simple jobs taking over 3 hours (or just
>> failing) for a set of files with a compressed size of only 1-2gb. I will try
>> and provide as much information as I can here, so if someone can help, that
>> would be really great.
>> 
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>>> € Master node:
>>> 
>>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>> 
>>> ­ 64GB DDR3 SDRAM
>>> 
>>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>> 
>>> € Slave nodes (each):
>>> 
>>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>> 
>>> ­ 32GB DDR3 SDRAM
>>> 
>>> ­ 4 x 2TB SATA-3 hard drive
>>> 
>>> € Operating system on all nodes: openSUSE Linux 13.1
>> 
>> We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>> 
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>> 
>> No one else uses this cluster while I am working.
>> 
>> What I¹m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using this
>> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
>> series of tables from these files and finally run a Python script on one of
>> them to perform some scientific calculations. The files are .xml.gz format
>> and (uncompressed) are only about 4mb in size each. hive.input.format is set
>> to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the
>> ³small files problem.²
>> 
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for much
>> larger numbers, doing select * works fine, which means the files are being
>> read properly, but if I do something as simple as selecting just one column
>> from the whole table for a larger number of files, containers start being
>> killed and jobs fail with this error in the container logs:
>> 
>> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
>> Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.
>> LeaseExpiredException): No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0: File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.j
>> ava:2398)
>> 
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>> 
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>> 
>> For 5000 files (1gb compressed), the selection of a single column finishes,
>> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
>> then errors out.
>> 
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
>> on each slave node.
>> 
>> This is the output of netstat ­i while the column selection is running:
>> Kernel Interface table
>> 
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR 
>> Flg
>> 
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0 
>> BMRU
>> 
>> eth1   1500   0 77410508      0      0      0 40815746      0      0      0 
>> BMRU
>> 
>> lo    65536   0 16593808      0      0      0 16593808      0      0      0 
>> LRU
>> 
>> 
>> 
>> 
>> 
>> Are there some settings I am missing that mean the cluster isn¹t processing 
>> this data as efficiently as it can?
>> 
>> I am very new to Hadoop and there are so many logs, etc, that troubleshooting 
>> can be a bit overwhelming. Where else should I be looking to try and diagnose 
>> what is wrong?
>> 
>> Thanks in advance for any help you can give!
>> 
>> Kind regards,
>> Ana 
>> 




Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Hi,

Unfortunately, after I set my user¹s ulimit ­n to  65536, I still get the
same bad performance, killed containers and errors as before.

I collected together a bunch of logs around the moment when the containers
are being killed (application master log, killed container log, hadoop-hdfs
logs, hadoop-yarn logs) and this is what¹s happening in order. The
PrivilegedActionException, is that something? Is my user not in some sort of
correct group?

Thanks in advance!
Ana

2014-08-04 12:21:47,868 INFO [IPC Server handler 4 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:47,979 INFO [IPC Server handler 5 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_0
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_0 is : 0.061734322
2014-08-04 12:21:48,102 INFO [IPC Server handler 6 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_0
2014-08-04 12:21:48,110 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocateBlock: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1. BP-1198908146-10.0.0.11-1399888065184
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]}
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.40:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,145 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.43:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,146 INFO BlockStateChange: BLOCK* addStoredBlock:
blockMap updated: 10.0.0.44:50010 is added to
blk_-300655159790361685_67008{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.0.0.44:50010|RBW],
ReplicaUnderConstruction[10.0.0.43:50010|RBW],
ReplicaUnderConstruction[10.0.0.40:50010|RBW]]} size 0
2014-08-04 12:21:48,147 INFO org.apache.hadoop.hdfs.StateChange: DIR*
completeFile: 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_1 is closed by
DFSClient_attempt_1403771939632_0413_m_000006_1_-1142346950_1
2014-08-04 12:21:48,147 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
422 Total time for transactions(ms): 5 Number of transactions batched in
Syncs: 0 Number of syncs: 186 SyncTimes(ms): 204
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 0.93497133
2014-08-04 12:21:48,155 INFO [IPC Server handler 7 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000006_1 is : 1.0
2014-08-04 12:21:48,176 INFO [IPC Server handler 8 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,177 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from RUNNING
to SUCCESS_CONTAINER_CLEANUP
2014-08-04 12:21:48,177 INFO [IPC Server handler 9 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,178 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000016 taskAttempt
attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,195 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_1 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from RUNNING
to KILL_CONTAINER_CLEANUP
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Issuing kill to other
attempt attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1403771939632_0413_m_000006_1
2014-08-04 12:21:48,196 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1403771939632_0413_m_000006 Task Transitioned from RUNNING to SUCCEEDED
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,197 INFO [ContainerLauncher #7]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_1403771939632_0413_01_000009 taskAttempt
attempt_1403771939632_0413_m_000006_0
2014-08-04 12:21:48,219 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_CONTAINER_CLEANUP to KILL_TASK_CLEANUP
2014-08-04 12:21:48,230 INFO [CommitterEvent Processor #1]
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing
the event EventType: TASK_ABORT
2014-08-04 12:21:48,231 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1403771939632_0413_m_000006_0 TaskAttempt Transitioned from
KILL_TASK_CLEANUP to KILLED
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1403771939632_0413_m_000002_1 is : 0.047582902
2014-08-04 12:21:48,318 INFO [IPC Server handler 10 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1403771939632_0413_m_000002_1
2014-08-04 12:21:48,370 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:zslf023 (auth:SIMPLE)
cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on 
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,370 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.0.0.42:41375: error:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,378 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,379 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-04_12-16-12_475_8240516609553931504/_task_tmp
.-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0413_m_000006_0_-820182675_1 does not have
any open files.
2014-08-04 12:21:48,383 INFO [IPC Server handler 11 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000000_1
2014-08-04 12:21:48,620 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:14
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,198 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000016 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000016
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000016 on node: host:
slave4.hadoop.local:57340 #containers=1 available=24576 used=4096 with
event: FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000016, NodeId: slave4.hadoop.local:57340,
NodeHttpAddress: slave4.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 16, }, state: C_COMPLETE,
diagnostics: "Container killed by the ApplicationMaster.\n\n", exit_status:
143, ] resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:57344,
vCores:14>usedCapacity=0.2857143, absoluteUsedCapacity=0.2857143, numApps=1,
numContainers=14 usedCapacity=0.2857143 absoluteUsedCapacity=0.2857143
used=<memory:57344, vCores:14> cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:57344, vCores:14> numContainers=14 user=zslf023
user-resources=<memory:57344, vCores:14>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.2857143
absoluteUsedCapacity=0.2857143 used=<memory:57344, vCores:14>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000016 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,199 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000016 of
capacity <memory:4096, vCores:1> on host slave4.hadoop.local:57340, which
currently has 1 containers, <memory:4096, vCores:1> used and <memory:24576,
vCores:15> available, release resources=true
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=zslf023
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1403771939632_0413
CONTAINERID=container_1403771939632_0413_01_000009
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403771939632_0413_01_000009 Container Transitioned from RUNNING
to COMPLETED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:53248, vCores:13> numContainers=13 user=zslf023
user-resources=<memory:53248, vCores:13>
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerApp: Completed container: container_1403771939632_0413_01_000009 in
state: COMPLETED event:FINISHED
2014-08-04 12:21:49,215 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSche
dulerNode: Released container container_1403771939632_0413_01_000009 of
capacity <memory:4096, vCores:1> on host slave2.hadoop.local:51269, which
currently has 6 containers, <memory:24576, vCores:6> used and <memory:4096,
vCores:10> available, release resources=true
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySch
eduler: Application appattempt_1403771939632_0413_000001 released container
container_1403771939632_0413_01_000009 on node: host:
slave2.hadoop.local:51269 #containers=6 available=4096 used=24576 with
event: FINISHED
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1403771939632_0413_01_000009, NodeId: slave2.hadoop.local:51269,
NodeHttpAddress: slave2.hadoop.local:8042, Resource: <memory:4096,
vCores:1>, Priority: 20, State: NEW, Token: null, Status: container_id {,
app_attempt_id {, application_id {, id: 413, cluster_timestamp:
1403771939632, }, attemptId: 1, }, id: 9, }, state: C_COMPLETE, diagnostics:
"Container killed by the ApplicationMaster.\n\n", exit_status: 143, ]
resource=<memory:4096, vCores:1> queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:53248,
vCores:13>usedCapacity=0.26530612, absoluteUsedCapacity=0.26530612,
numApps=1, numContainers=13 usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,216 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
: completedContainer queue=root usedCapacity=0.26530612
absoluteUsedCapacity=0.26530612 used=<memory:53248, vCores:13>
cluster=<memory:200704, vCores:112>
2014-08-04 12:21:49,363 INFO [IPC Server handler 12 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000001_1
2014-08-04 12:21:49,502 INFO [IPC Server handler 13 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000003_1
2014-08-04 12:21:49,623 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000016
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_0: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1403771939632_0413_m_000006_1: Container killed by the
ApplicationMaster.
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:12
AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:15 ContRel:0
HostLocal:5 RackLocal:10
2014-08-04 12:21:49,624 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1403771939632_0413_01_000009
2014-08-04 12:21:49,702 INFO [IPC Server handler 14 on 49784]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0413_m_000004_1



From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 20:02
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks


From: Ana Gillan <an...@gmail.com>
Date: Saturday, 2 August 2014 18:38
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks

From: hadoop hive <ha...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Saturday, 2 August 2014 17:36
To: <us...@hadoop.apache.org>
Subject: Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Filemax across the cluster is set to over 6 million. I¹ve checked the open
> file limits for the accounts used by the Hadoop daemons  and they have an open
> file limit of 32K. This is confirmed by the various .out files, e.g.
> 
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
> 
> Contains open files (-n) 32768. Is this too low? What is the recommended value
> for open files on all nodes? Also does my own user need to have the same
> value?
> 
> I¹ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally 2gb
> compressed), but this job won¹t progress past 0% map.
> 
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> For my own user? It is as follows:
> 
> core file size          (blocks, -c) 0
> 
> data seg size           (kbytes, -d) unlimited
> 
> scheduling priority             (-e) 0
> 
> file size               (blocks, -f) unlimited
> 
> pending signals                 (-i) 483941
> 
> max locked memory       (kbytes, -l) 64
> 
> max memory size         (kbytes, -m) unlimited
> 
> open files                      (-n) 1024
> 
> pipe size            (512 bytes, -p) 8
> 
> POSIX message queues     (bytes, -q) 819200
> 
> real-time priority              (-r) 0
> 
> stack size              (kbytes, -s) 8192
> 
> cpu time               (seconds, -t) unlimited
> 
> max user processes              (-u) 800
> 
> virtual memory          (kbytes, -v) unlimited
> 
> file locks                      (-x) unlimited
> 
> 
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException)
> 
> 
> Can you check the ulimit for tour user. Which might be causing this.
> 
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>> Hi everyone,
>> 
>> I am having an issue with MapReduce jobs running through Hive being killed
>> after 600s timeouts and with very simple jobs taking over 3 hours (or just
>> failing) for a set of files with a compressed size of only 1-2gb. I will try
>> and provide as much information as I can here, so if someone can help, that
>> would be really great.
>> 
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>>> € Master node:
>>> 
>>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>> 
>>> ­ 64GB DDR3 SDRAM
>>> 
>>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>> 
>>> € Slave nodes (each):
>>> 
>>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>> 
>>> ­ 32GB DDR3 SDRAM
>>> 
>>> ­ 4 x 2TB SATA-3 hard drive
>>> 
>>> € Operating system on all nodes: openSUSE Linux 13.1
>> 
>> We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>> 
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>> 
>> No one else uses this cluster while I am working.
>> 
>> What I¹m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using this
>> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
>> series of tables from these files and finally run a Python script on one of
>> them to perform some scientific calculations. The files are .xml.gz format
>> and (uncompressed) are only about 4mb in size each. hive.input.format is set
>> to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the
>> ³small files problem.²
>> 
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for much
>> larger numbers, doing select * works fine, which means the files are being
>> read properly, but if I do something as simple as selecting just one column
>> from the whole table for a larger number of files, containers start being
>> killed and jobs fail with this error in the container logs:
>> 
>> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
>> Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.
>> LeaseExpiredException): No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.
>> -ext-10001/_tmp.000000_0: File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.j
>> ava:2398)
>> 
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>> 
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>> 
>> For 5000 files (1gb compressed), the selection of a single column finishes,
>> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
>> then errors out.
>> 
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
>> on each slave node.
>> 
>> This is the output of netstat ­i while the column selection is running:
>> Kernel Interface table
>> 
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR 
>> Flg
>> 
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0 
>> BMRU
>> 
>> eth1   1500   0 77410508      0      0      0 40815746      0      0      0 
>> BMRU
>> 
>> lo    65536   0 16593808      0      0      0 16593808      0      0      0 
>> LRU
>> 
>> 
>> 
>> 
>> 
>> Are there some settings I am missing that mean the cluster isn¹t processing 
>> this data as efficiently as it can?
>> 
>> I am very new to Hadoop and there are so many logs, etc, that troubleshooting 
>> can be a bit overwhelming. Where else should I be looking to try and diagnose 
>> what is wrong?
>> 
>> Thanks in advance for any help you can give!
>> 
>> Kind regards,
>> Ana 
>> 




Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Ok, I will request this to be done, as I¹m not an admin, and then get back
to this thread on Monday. Thank you!

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 18:50
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks
 On Aug 2, 2014 11:09 PM, "Ana Gillan" <an...@gmail.com> wrote:

> I’m not sure which user is fetching the data, but I’m assuming no one
> changed that from the default. The data isn’t huge in size, just in number,
> so I suppose the open files limit is not the issue?
>
> I’m running the job again with mapred.task.timeout=1200000, but containers
> are still being killed in the same way… Just without the timeout message.
> And it somehow massively slowed down the machine as well, so even typing
> commands took a long time (???)
>
> I’m not sure what you mean by which stage it’s getting killed on. If you
> mean in the command line progress counters, it's always on Stage-1.
> Also, this is the end of the container log for the killed container.
> Failed and killed jobs always start fine with lots of these “processing
> file” and “processing alias” statements, but then suddenly warn about a
> DataStreamer Exception and then are killed with an error, which is the same
> as the warning. Not sure if this exception is the actual issue or if it’s
> just a knock-on effect of something else.
>
> 2014-08-02 17:47:38,618 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
> 2014-08-02 17:47:38,641 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:38,932 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
> 2014-08-02 17:47:38,989 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:42,675 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
> 2014-08-02 17:47:42,888 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
> 2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
>
>
>
> Thanks a lot for your attention!	
>
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 17:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> 32k seems fine for mapred user(hope you using this for fetching you data)
> but if you have huge data on your system you can try 64k.
>
> Did you try increasing you time from 600 sec to like 20 mins.
>
> Can you also check on which stage its getting hanged or killed.
>
> Thanks
>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks
 On Aug 2, 2014 11:09 PM, "Ana Gillan" <an...@gmail.com> wrote:

> I’m not sure which user is fetching the data, but I’m assuming no one
> changed that from the default. The data isn’t huge in size, just in number,
> so I suppose the open files limit is not the issue?
>
> I’m running the job again with mapred.task.timeout=1200000, but containers
> are still being killed in the same way… Just without the timeout message.
> And it somehow massively slowed down the machine as well, so even typing
> commands took a long time (???)
>
> I’m not sure what you mean by which stage it’s getting killed on. If you
> mean in the command line progress counters, it's always on Stage-1.
> Also, this is the end of the container log for the killed container.
> Failed and killed jobs always start fine with lots of these “processing
> file” and “processing alias” statements, but then suddenly warn about a
> DataStreamer Exception and then are killed with an error, which is the same
> as the warning. Not sure if this exception is the actual issue or if it’s
> just a knock-on effect of something else.
>
> 2014-08-02 17:47:38,618 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
> 2014-08-02 17:47:38,641 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:38,932 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
> 2014-08-02 17:47:38,989 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:42,675 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
> 2014-08-02 17:47:42,888 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
> 2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
>
>
>
> Thanks a lot for your attention!	
>
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 17:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> 32k seems fine for mapred user(hope you using this for fetching you data)
> but if you have huge data on your system you can try 64k.
>
> Did you try increasing you time from 600 sec to like 20 mins.
>
> Can you also check on which stage its getting hanged or killed.
>
> Thanks
>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks
 On Aug 2, 2014 11:09 PM, "Ana Gillan" <an...@gmail.com> wrote:

> I’m not sure which user is fetching the data, but I’m assuming no one
> changed that from the default. The data isn’t huge in size, just in number,
> so I suppose the open files limit is not the issue?
>
> I’m running the job again with mapred.task.timeout=1200000, but containers
> are still being killed in the same way… Just without the timeout message.
> And it somehow massively slowed down the machine as well, so even typing
> commands took a long time (???)
>
> I’m not sure what you mean by which stage it’s getting killed on. If you
> mean in the command line progress counters, it's always on Stage-1.
> Also, this is the end of the container log for the killed container.
> Failed and killed jobs always start fine with lots of these “processing
> file” and “processing alias” statements, but then suddenly warn about a
> DataStreamer Exception and then are killed with an error, which is the same
> as the warning. Not sure if this exception is the actual issue or if it’s
> just a knock-on effect of something else.
>
> 2014-08-02 17:47:38,618 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
> 2014-08-02 17:47:38,641 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:38,932 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
> 2014-08-02 17:47:38,989 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:42,675 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
> 2014-08-02 17:47:42,888 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
> 2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
>
>
>
> Thanks a lot for your attention!	
>
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 17:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> 32k seems fine for mapred user(hope you using this for fetching you data)
> but if you have huge data on your system you can try 64k.
>
> Did you try increasing you time from 600 sec to like 20 mins.
>
> Can you also check on which stage its getting hanged or killed.
>
> Thanks
>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
Hey try change ulimit to 64k for user which running query and change time
from scheduler which should be set to 600sec.

Check the jt logs also for further issues.

Thanks
 On Aug 2, 2014 11:09 PM, "Ana Gillan" <an...@gmail.com> wrote:

> I’m not sure which user is fetching the data, but I’m assuming no one
> changed that from the default. The data isn’t huge in size, just in number,
> so I suppose the open files limit is not the issue?
>
> I’m running the job again with mapred.task.timeout=1200000, but containers
> are still being killed in the same way… Just without the timeout message.
> And it somehow massively slowed down the machine as well, so even typing
> commands took a long time (???)
>
> I’m not sure what you mean by which stage it’s getting killed on. If you
> mean in the command line progress counters, it's always on Stage-1.
> Also, this is the end of the container log for the killed container.
> Failed and killed jobs always start fine with lots of these “processing
> file” and “processing alias” statements, but then suddenly warn about a
> DataStreamer Exception and then are killed with an error, which is the same
> as the warning. Not sure if this exception is the actual issue or if it’s
> just a knock-on effect of something else.
>
> 2014-08-02 17:47:38,618 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
> 2014-08-02 17:47:38,641 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:38,932 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
> 2014-08-02 17:47:38,989 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:42,675 INFO [main]
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
> hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
> 2014-08-02 17:47:42,888 INFO [main]
> org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
> foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
> 2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
> 2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.-ext-10001/_tmp.000006_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2217)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2137)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:491)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1240)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:311)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1156)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1009)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
>
>
>
> Thanks a lot for your attention!	
>
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 17:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> 32k seems fine for mapred user(hope you using this for fetching you data)
> but if you have huge data on your system you can try 64k.
>
> Did you try increasing you time from 600 sec to like 20 mins.
>
> Can you also check on which stage its getting hanged or killed.
>
> Thanks
>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 17:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 17:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 17:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
I¹m not sure which user is fetching the data, but I¹m assuming no one
changed that from the default. The data isn¹t huge in size, just in number,
so I suppose the open files limit is not the issue?

I¹m running the job again with mapred.task.timeout=1200000, but containers
are still being killed in the same wayŠ Just without the timeout message.
And it somehow massively slowed down the machine as well, so even typing
commands took a long time (???)

I¹m not sure what you mean by which stage it¹s getting killed on. If you
mean in the command line progress counters, it's always on Stage-1.
Also, this is the end of the container log for the killed container. Failed
and killed jobs always start fine with lots of these ³processing file² and
³processing alias² statements, but then suddenly warn about a DataStreamer
Exception and then are killed with an error, which is the same as the
warning. Not sure if this exception is the actual issue or if it¹s just a
knock-on effect of something else.

2014-08-02 17:47:38,618 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w63.xml.gz
2014-08-02 17:47:38,641 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:38,932 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w67.xml.gz
2014-08-02 17:47:38,989 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:42,675 INFO [main]
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file
hdfs://clustnm:8020/user/usnm123/foldernm/fivek/2w6i.xml.gz
2014-08-02 17:47:42,888 INFO [main]
org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias
foldernm_xml_load for file hdfs://clustnm:8020/user/usnm123/foldernm/fivek
2014-08-02 17:47:45,416 WARN [Thread-8] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)
2014-08-02 17:47:45,417 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-usnm123/hive_2014-08-02_17-41-52_914_251548734850890001/_task_tmp.
-ext-10001/_tmp.000006_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0409_m_000006_0_303479000_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNames
ystem.java:2217)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNam
esystem.java:2137)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRp
cServer.java:491)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslator
PB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:351)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNam
enodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40744)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Proto
bufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

at org.apache.hadoop.ipc.Client.call(Client.java:1240)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.jav
a:202)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:83)
at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBloc
k(ClientNamenodeProtocolTranslatorPB.java:311)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFS
OutputStream.java:1156)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DF
SOutputStream.java:1009)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java
:464)


Thanks a lot for your attention!	

From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 17:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
 On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:

> Filemax across the cluster is set to over 6 million. I’ve checked the
> open file limits for the accounts used by the Hadoop daemons
>  and they have an open file limit of 32K. This is confirmed by the various
> .out files, e.g.
>
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
>
> Contains open files (-n) 32768. Is this too low? What is the recommended
> value for open files on all nodes? Also does my own user need to have the
> same value?
>
> I’ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally
> 2gb compressed), but this job won’t progress past 0% map.
>
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> For my own user? It is as follows:
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 483941
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 1024
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 8192
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 800
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> Can you check the ulimit for tour user. Which might be causing this.
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am having an issue with MapReduce jobs running through Hive being
>> killed after 600s timeouts and with very simple jobs taking over 3 hours
>> (or just failing) for a set of files with a compressed size of only 1-2gb.
>> I will try and provide as much information as I can here, so if someone can
>> help, that would be really great.
>>
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following
>> config:
>>
>> • Master node:
>>
>> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>
>> – 64GB DDR3 SDRAM
>>
>> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>
>> • Slave nodes (each):
>>
>> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>
>> – 32GB DDR3 SDRAM
>>
>> – 4 x 2TB SATA-3 hard drive
>>
>> • Operating system on all nodes: openSUSE Linux 13.1
>>
>>  We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>>
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>>
>> No one else uses this cluster while I am working.
>>
>> What I’m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using
>> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
>> create a series of tables from these files and finally run a Python script
>> on one of them to perform some scientific calculations. The files are
>> .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format
>> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
>> avoid the “small files problem.”
>>
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for
>> much larger numbers, doing select * works fine, which means the files are
>> being read properly, but if I do something as simple as selecting just one
>> column from the whole table for a larger number of files, containers start
>> being killed and jobs fail with this error in the container logs:
>>
>> 2014-08-02 14:51:45,137 ERROR [Thread-3]
>> org.apache.hadoop.hdfs.DFSClient: Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
>> File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>>
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>>
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>>
>> For 5000 files (1gb compressed), the selection of a single column
>> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
>> 4% map and then errors out.
>>
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated
>> to Hadoop on each slave node.
>>
>> This is the output of netstat –i while the column selection is running:
>>
>> Kernel Interface table
>>
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>>
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>>   0 BMRU
>>
>> eth1   1500   0 77410508      0      0      0 40815746      0      0
>> 0 BMRU
>>
>> lo    65536   0 16593808      0      0      0 16593808      0      0
>> 0 LRU
>>
>>
>>
>>
>> Are there some settings I am missing that mean the cluster isn’t
>> processing this data as efficiently as it can?
>>
>> I am very new to Hadoop and there are so many logs, etc, that
>> troubleshooting can be a bit overwhelming. Where else should I be looking
>> to try and diagnose what is wrong?
>>
>> Thanks in advance for any help you can give!
>>
>> Kind regards,
>> Ana
>>
>>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
 On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:

> Filemax across the cluster is set to over 6 million. I’ve checked the
> open file limits for the accounts used by the Hadoop daemons
>  and they have an open file limit of 32K. This is confirmed by the various
> .out files, e.g.
>
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
>
> Contains open files (-n) 32768. Is this too low? What is the recommended
> value for open files on all nodes? Also does my own user need to have the
> same value?
>
> I’ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally
> 2gb compressed), but this job won’t progress past 0% map.
>
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> For my own user? It is as follows:
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 483941
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 1024
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 8192
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 800
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> Can you check the ulimit for tour user. Which might be causing this.
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am having an issue with MapReduce jobs running through Hive being
>> killed after 600s timeouts and with very simple jobs taking over 3 hours
>> (or just failing) for a set of files with a compressed size of only 1-2gb.
>> I will try and provide as much information as I can here, so if someone can
>> help, that would be really great.
>>
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following
>> config:
>>
>> • Master node:
>>
>> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>
>> – 64GB DDR3 SDRAM
>>
>> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>
>> • Slave nodes (each):
>>
>> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>
>> – 32GB DDR3 SDRAM
>>
>> – 4 x 2TB SATA-3 hard drive
>>
>> • Operating system on all nodes: openSUSE Linux 13.1
>>
>>  We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>>
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>>
>> No one else uses this cluster while I am working.
>>
>> What I’m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using
>> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
>> create a series of tables from these files and finally run a Python script
>> on one of them to perform some scientific calculations. The files are
>> .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format
>> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
>> avoid the “small files problem.”
>>
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for
>> much larger numbers, doing select * works fine, which means the files are
>> being read properly, but if I do something as simple as selecting just one
>> column from the whole table for a larger number of files, containers start
>> being killed and jobs fail with this error in the container logs:
>>
>> 2014-08-02 14:51:45,137 ERROR [Thread-3]
>> org.apache.hadoop.hdfs.DFSClient: Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
>> File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>>
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>>
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>>
>> For 5000 files (1gb compressed), the selection of a single column
>> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
>> 4% map and then errors out.
>>
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated
>> to Hadoop on each slave node.
>>
>> This is the output of netstat –i while the column selection is running:
>>
>> Kernel Interface table
>>
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>>
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>>   0 BMRU
>>
>> eth1   1500   0 77410508      0      0      0 40815746      0      0
>> 0 BMRU
>>
>> lo    65536   0 16593808      0      0      0 16593808      0      0
>> 0 LRU
>>
>>
>>
>>
>> Are there some settings I am missing that mean the cluster isn’t
>> processing this data as efficiently as it can?
>>
>> I am very new to Hadoop and there are so many logs, etc, that
>> troubleshooting can be a bit overwhelming. Where else should I be looking
>> to try and diagnose what is wrong?
>>
>> Thanks in advance for any help you can give!
>>
>> Kind regards,
>> Ana
>>
>>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
 On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:

> Filemax across the cluster is set to over 6 million. I’ve checked the
> open file limits for the accounts used by the Hadoop daemons
>  and they have an open file limit of 32K. This is confirmed by the various
> .out files, e.g.
>
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
>
> Contains open files (-n) 32768. Is this too low? What is the recommended
> value for open files on all nodes? Also does my own user need to have the
> same value?
>
> I’ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally
> 2gb compressed), but this job won’t progress past 0% map.
>
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> For my own user? It is as follows:
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 483941
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 1024
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 8192
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 800
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> Can you check the ulimit for tour user. Which might be causing this.
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am having an issue with MapReduce jobs running through Hive being
>> killed after 600s timeouts and with very simple jobs taking over 3 hours
>> (or just failing) for a set of files with a compressed size of only 1-2gb.
>> I will try and provide as much information as I can here, so if someone can
>> help, that would be really great.
>>
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following
>> config:
>>
>> • Master node:
>>
>> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>
>> – 64GB DDR3 SDRAM
>>
>> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>
>> • Slave nodes (each):
>>
>> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>
>> – 32GB DDR3 SDRAM
>>
>> – 4 x 2TB SATA-3 hard drive
>>
>> • Operating system on all nodes: openSUSE Linux 13.1
>>
>>  We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>>
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>>
>> No one else uses this cluster while I am working.
>>
>> What I’m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using
>> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
>> create a series of tables from these files and finally run a Python script
>> on one of them to perform some scientific calculations. The files are
>> .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format
>> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
>> avoid the “small files problem.”
>>
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for
>> much larger numbers, doing select * works fine, which means the files are
>> being read properly, but if I do something as simple as selecting just one
>> column from the whole table for a larger number of files, containers start
>> being killed and jobs fail with this error in the container logs:
>>
>> 2014-08-02 14:51:45,137 ERROR [Thread-3]
>> org.apache.hadoop.hdfs.DFSClient: Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
>> File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>>
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>>
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>>
>> For 5000 files (1gb compressed), the selection of a single column
>> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
>> 4% map and then errors out.
>>
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated
>> to Hadoop on each slave node.
>>
>> This is the output of netstat –i while the column selection is running:
>>
>> Kernel Interface table
>>
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>>
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>>   0 BMRU
>>
>> eth1   1500   0 77410508      0      0      0 40815746      0      0
>> 0 BMRU
>>
>> lo    65536   0 16593808      0      0      0 16593808      0      0
>> 0 LRU
>>
>>
>>
>>
>> Are there some settings I am missing that mean the cluster isn’t
>> processing this data as efficiently as it can?
>>
>> I am very new to Hadoop and there are so many logs, etc, that
>> troubleshooting can be a bit overwhelming. Where else should I be looking
>> to try and diagnose what is wrong?
>>
>> Thanks in advance for any help you can give!
>>
>> Kind regards,
>> Ana
>>
>>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by hadoop hive <ha...@gmail.com>.
32k seems fine for mapred user(hope you using this for fetching you data)
but if you have huge data on your system you can try 64k.

Did you try increasing you time from 600 sec to like 20 mins.

Can you also check on which stage its getting hanged or killed.

Thanks
 On Aug 2, 2014 9:38 PM, "Ana Gillan" <an...@gmail.com> wrote:

> Filemax across the cluster is set to over 6 million. I’ve checked the
> open file limits for the accounts used by the Hadoop daemons
>  and they have an open file limit of 32K. This is confirmed by the various
> .out files, e.g.
>
> /var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out
>
> Contains open files (-n) 32768. Is this too low? What is the recommended
> value for open files on all nodes? Also does my own user need to have the
> same value?
>
> I’ve also tried running the same column selection on files crushed by the
> filecrush program https://github.com/edwardcapriolo/filecrush/
> This created 5 large files out of the 10,000 small files (still totally
> 2gb compressed), but this job won’t progress past 0% map.
>
> From: Ana Gillan <an...@gmail.com>
> Date: Saturday, 2 August 2014 16:36
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> For my own user? It is as follows:
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 483941
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 1024
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 8192
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 800
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
> From: hadoop hive <ha...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Saturday, 2 August 2014 16:34
> To: <us...@hadoop.apache.org>
> Subject: Re:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)
>
> Can you check the ulimit for tour user. Which might be causing this.
> On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I am having an issue with MapReduce jobs running through Hive being
>> killed after 600s timeouts and with very simple jobs taking over 3 hours
>> (or just failing) for a set of files with a compressed size of only 1-2gb.
>> I will try and provide as much information as I can here, so if someone can
>> help, that would be really great.
>>
>> I have a cluster of 7 nodes (1 master, 6 slaves) with the following
>> config:
>>
>> • Master node:
>>
>> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>>
>> – 64GB DDR3 SDRAM
>>
>> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>>
>> • Slave nodes (each):
>>
>> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>>
>> – 32GB DDR3 SDRAM
>>
>> – 4 x 2TB SATA-3 hard drive
>>
>> • Operating system on all nodes: openSUSE Linux 13.1
>>
>>  We have the Apache BigTop package version 0.7, with Hadoop version
>> 2.0.6-alpha and Hive version 0.11.
>> YARN has been configured as per these recommendations:
>> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>>
>> I also set the following additional settings before running jobs:
>> set yarn.nodemanager.resource.cpu-vcores=4;
>> set mapred.tasktracker.map.tasks.maximum=4;
>> set hive.hadoop.supports.splittable.combineinputformat=true;
>> set hive.merge.mapredfiles=true;
>>
>> No one else uses this cluster while I am working.
>>
>> What I’m trying to do:
>> I have a bunch of XML files on HDFS, which I am reading into Hive using
>> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
>> create a series of tables from these files and finally run a Python script
>> on one of them to perform some scientific calculations. The files are
>> .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format
>> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
>> avoid the “small files problem.”
>>
>> Problems:
>> My HQL statements work perfectly for up to 1000 of these files. Even for
>> much larger numbers, doing select * works fine, which means the files are
>> being read properly, but if I do something as simple as selecting just one
>> column from the whole table for a larger number of files, containers start
>> being killed and jobs fail with this error in the container logs:
>>
>> 2014-08-02 14:51:45,137 ERROR [Thread-3]
>> org.apache.hadoop.hdfs.DFSClient: Failed to close file
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
>> File does not exist. Holder
>> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
>> any open files.
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>>
>> Killed jobs show the above and also the following message:
>> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
>> secsContainer killed by the ApplicationMaster.
>>
>> Also, in the node logs, I get a lot of pings like this:
>> INFO [IPC Server handler 17 on 40961]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
>> attempt_1403771939632_0362_m_000002_0
>>
>> For 5000 files (1gb compressed), the selection of a single column
>> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
>> 4% map and then errors out.
>>
>> While the jobs are running, I notice that the containers are not evenly
>> distributed across the cluster. Some nodes lie idle, while the application
>> master node runs 7 containers, maxing out the 28gb of RAM allocated
>> to Hadoop on each slave node.
>>
>> This is the output of netstat –i while the column selection is running:
>>
>> Kernel Interface table
>>
>> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>>
>> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>>   0 BMRU
>>
>> eth1   1500   0 77410508      0      0      0 40815746      0      0
>> 0 BMRU
>>
>> lo    65536   0 16593808      0      0      0 16593808      0      0
>> 0 LRU
>>
>>
>>
>>
>> Are there some settings I am missing that mean the cluster isn’t
>> processing this data as efficiently as it can?
>>
>> I am very new to Hadoop and there are so many logs, etc, that
>> troubleshooting can be a bit overwhelming. Where else should I be looking
>> to try and diagnose what is wrong?
>>
>> Thanks in advance for any help you can give!
>>
>> Kind regards,
>> Ana
>>
>>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Filemax across the cluster is set to over 6 million. I¹ve checked the open
file limits for the accounts used by the Hadoop daemons  and they have an
open file limit of 32K. This is confirmed by the various .out files, e.g.

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out

Contains open files (-n) 32768. Is this too low? What is the recommended
value for open files on all nodes? Also does my own user need to have the
same value?

I¹ve also tried running the same column selection on files crushed by the
filecrush program https://github.com/edwardcapriolo/filecrush/
This created 5 large files out of the 10,000 small files (still totally 2gb
compressed), but this job won¹t progress past 0% map.

From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 16:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Filemax across the cluster is set to over 6 million. I¹ve checked the open
file limits for the accounts used by the Hadoop daemons  and they have an
open file limit of 32K. This is confirmed by the various .out files, e.g.

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out

Contains open files (-n) 32768. Is this too low? What is the recommended
value for open files on all nodes? Also does my own user need to have the
same value?

I¹ve also tried running the same column selection on files crushed by the
filecrush program https://github.com/edwardcapriolo/filecrush/
This created 5 large files out of the 10,000 small files (still totally 2gb
compressed), but this job won¹t progress past 0% map.

From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 16:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Filemax across the cluster is set to over 6 million. I¹ve checked the open
file limits for the accounts used by the Hadoop daemons  and they have an
open file limit of 32K. This is confirmed by the various .out files, e.g.

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out

Contains open files (-n) 32768. Is this too low? What is the recommended
value for open files on all nodes? Also does my own user need to have the
same value?

I¹ve also tried running the same column selection on files crushed by the
filecrush program https://github.com/edwardcapriolo/filecrush/
This created 5 large files out of the 10,000 small files (still totally 2gb
compressed), but this job won¹t progress past 0% map.

From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 16:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
Filemax across the cluster is set to over 6 million. I¹ve checked the open
file limits for the accounts used by the Hadoop daemons  and they have an
open file limit of 32K. This is confirmed by the various .out files, e.g.

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out

Contains open files (-n) 32768. Is this too low? What is the recommended
value for open files on all nodes? Also does my own user need to have the
same value?

I¹ve also tried running the same column selection on files crushed by the
filecrush program https://github.com/edwardcapriolo/filecrush/
This created 5 large files out of the 10,000 small files (still totally 2gb
compressed), but this job won¹t progress past 0% map.

From:  Ana Gillan <an...@gmail.com>
Date:  Saturday, 2 August 2014 16:36
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)

For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
> 



Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Posted by Ana Gillan <an...@gmail.com>.
For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited


From:  hadoop hive <ha...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Saturday, 2 August 2014 16:34
To:  <us...@hadoop.apache.org>
Subject:  Re: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException)


Can you check the ulimit for tour user. Which might be causing this.

On Aug 2, 2014 8:54 PM, "Ana Gillan" <an...@gmail.com> wrote:
> Hi everyone,
> 
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> 
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> 
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> 
>> ­ 64GB DDR3 SDRAM
>> 
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> 
>> € Slave nodes (each):
>> 
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> 
>> ­ 32GB DDR3 SDRAM
>> 
>> ­ 4 x 2TB SATA-3 hard drive
>> 
>> € Operating system on all nodes: openSUSE Linux 13.1
> 
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
> 
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> 
> No one else uses this cluster while I am working.
> 
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small
> files problem.² 
> 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> 
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> 
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> 
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> 
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> 
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> 
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
> Flg
> 
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> BMRU
> 
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> BMRU
> 
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> LRU
> 
> 
> 
> 
> 
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> 
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> 
> Thanks in advance for any help you can give!
> 
> Kind regards,
> Ana 
>