You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Everett Anderson <ev...@nuna.com> on 2015/08/14 23:10:28 UTC

LeaseExpiredExceptions and temp side effect files

Hi,

I recently started trying to run our Crunch pipeline on more data and have
been trying out different AWS instance types in anticipation of our storage
and compute needs.

I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the
CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix).

Our pipeline finishes fine in these cluster configurations:

   - 50 c3.4xlarge Core, 0 Task
   - 10 c3.8xlarge Core, 0 Task
   - 25 c3.8xlarge Core, 0 Task

However, it always fails on the same data when using 10 cc2.8xlarge Core
instances.

The biggest obvious hardware difference is that the cc2.8xlarges use hard
disks instead of SSDs.

While it's a little hard to track down the exact originating failure, I
think it's from errors like:

2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
attempt_1439499407003_0028_r_000153_1 - exited :
org.apache.crunch.CrunchRuntimeException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
File does not exist. Holder
DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
any open files.

Those paths look like these side effect files
<https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
.

Would Crunch have generated applications that depend on side effect paths
as input across MapReduce applications and something in HDFS is cleaning up
those paths, unaware of the higher level dependencies? AWS configures
Hadoop differently for each instance type, and might have more aggressive
cleanup settings on HDs, though this is very uninformed hypothesis.

A sample full log is attached.

Thanks for any guidance!

- Everett

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jw...@cloudera.com>.
Yeah, that makes sense to me-- not totally trivial to do, but it should be
possible.

J

On Tue, Sep 29, 2015 at 4:42 PM, Everett Anderson <ev...@nuna.com> wrote:

> Hey,
>
> We have some leads. Increasing the datanode memory seems to help the
> immediate issue.
>
> However, we need a solution to our buildup of temporary outputs. We're
> exploring segmenting our pipeline with run()/cleanup() calls.
>
> I'm curious, though --
>
> Do you think it'd be possible for us to make a Crunch modification to
> optionally actively cleanup temporary outputs? It seems like the planner
> would know what those are.
>
> A temporary output would be any PCollection that isn't referenced by
> outside of Crunch (or perhaps ones that aren't explicitly marked as cached).
>
>
> On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hrm. If you never call Pipeline.done, you should never cleanup the
>> temporary files for the job...
>>
>> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>> While we tried to take comfort in the fact that we'd only seen this only
>>> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>>> larger amounts of data on SSD-based c3.4x8larges.
>>>
>>> My two hypotheses are
>>>
>>> 1) Somehow these temp files are getting cleaned up before they're
>>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>>> cleans up these temp directories, or perhaps there's a bunch in Crunch's
>>> planner.
>>>
>>> 2) HDFS has chosen 3 machines to replicate data to, but it is performing
>>> a very lopsided replication. While the cluster overall looks like it has
>>> HDFS capacity, perhaps a small subset of the machines is actually at
>>> capacity. Things seem to fail in obscure ways when running out of disk.
>>>
>>>
>>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>> 	... 9 more
>>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> 	at java.lang.reflect.Method.invoke(Method.java:606)
>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
>>> 	... 22 more
>>>
>>>
>>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:
>>>
>>>> Also worth noting, we inspected the hadoop configuration defaults that
>>>> the AWS EMR service populates for the two different instance types, for
>>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>>> identical, with the exception of slight differences in JVM memory allotted.
>>>> Further investigated the max number of file descriptors for each instance
>>>> type via ulimit, and saw no differences there either.
>>>>
>>>> So not sure what the main difference is between these two clusters that
>>>> would cause these very different outcomes, other than cc2.8xlarge having
>>>> SSDs and c3.8xlarge having spinning disks.
>>>>
>>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Jeff graciously agreed to try it out.
>>>>>
>>>>> I'm afraid we're still getting failures on that instance type, though
>>>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>>>> applications could be submitted afterwards.
>>>>>
>>>>> The errors when running the pipeline seem to be similarly HDFS
>>>>> related. It's quite odd.
>>>>>
>>>>> Examples when using 0.11 + the patches:
>>>>>
>>>>>
>>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>>> pendingcreates: 24]
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>>
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>>
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> java.io.IOException: Unable to create new block.
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>>>> - Aborting...
>>>>> 2015-08-20 23:34:59,279 WARN [main]
>>>>> org.apache.hadoop.mapred.YarnChild: Exception running child :
>>>>> org.apache.crunch.CrunchRuntimeException: java.io.IOException: Bad connect
>>>>> ack with firstBadLink as 10.55.1.103:50010
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>>> at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Curious how this went. :)
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>>
>>>>>>> as we also rely on 517.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related
>>>>>>>> to this problem.)
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Everett,
>>>>>>>>>
>>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>>>>> 553 patch? Is that easy to do?
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <
>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>>>>> increase the number of retries?
>>>>>>>>>>
>>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>>> set to its default.
>>>>>>>>>>
>>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are
>>>>>>>>>> as well as how Crunch uses side effect files? Do you know if HDFS would
>>>>>>>>>> clean up those directories from underneath Crunch?
>>>>>>>>>>
>>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>>> The failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>>>>> SocketTimeoutException.
>>>>>>>>>>
>>>>>>>>>> Examples:
>>>>>>>>>>
>>>>>>>>>> [1] No lease exception
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>>> at
>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>>>> ... 9 more
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [2] File does not exist
>>>>>>>>>>
>>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>>
>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>>>> 	... 9 more
>>>>>>>>>>
>>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jwills@cloudera.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>>
>>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>>
>>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>>> missing temp files.
>>>>>>>>>>>
>>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> J
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>>> data and have been trying out different AWS instance types in anticipation
>>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12
>>>>>>>>>>>>> (patched with the CRUNCH-553
>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>>>> No lease on
>>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>>>>> any open files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Everett
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Director of Data Science
>>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Director of Data Science
>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
Hey,

We have some leads. Increasing the datanode memory seems to help the
immediate issue.

However, we need a solution to our buildup of temporary outputs. We're
exploring segmenting our pipeline with run()/cleanup() calls.

I'm curious, though --

Do you think it'd be possible for us to make a Crunch modification to
optionally actively cleanup temporary outputs? It seems like the planner
would know what those are.

A temporary output would be any PCollection that isn't referenced by
outside of Crunch (or perhaps ones that aren't explicitly marked as cached).


On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hrm. If you never call Pipeline.done, you should never cleanup the
> temporary files for the job...
>
> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> While we tried to take comfort in the fact that we'd only seen this only
>> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>> larger amounts of data on SSD-based c3.4x8larges.
>>
>> My two hypotheses are
>>
>> 1) Somehow these temp files are getting cleaned up before they're
>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>> cleans up these temp directories, or perhaps there's a bunch in Crunch's
>> planner.
>>
>> 2) HDFS has chosen 3 machines to replicate data to, but it is performing
>> a very lopsided replication. While the cluster overall looks like it has
>> HDFS capacity, perhaps a small subset of the machines is actually at
>> capacity. Things seem to fail in obscure ways when running out of disk.
>>
>>
>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>> 	... 9 more
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 	at java.lang.reflect.Method.invoke(Method.java:606)
>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
>> 	... 22 more
>>
>>
>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:
>>
>>> Also worth noting, we inspected the hadoop configuration defaults that
>>> the AWS EMR service populates for the two different instance types, for
>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>> identical, with the exception of slight differences in JVM memory allotted.
>>> Further investigated the max number of file descriptors for each instance
>>> type via ulimit, and saw no differences there either.
>>>
>>> So not sure what the main difference is between these two clusters that
>>> would cause these very different outcomes, other than cc2.8xlarge having
>>> SSDs and c3.8xlarge having spinning disks.
>>>
>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> Jeff graciously agreed to try it out.
>>>>
>>>> I'm afraid we're still getting failures on that instance type, though
>>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>>> applications could be submitted afterwards.
>>>>
>>>> The errors when running the pipeline seem to be similarly HDFS related.
>>>> It's quite odd.
>>>>
>>>> Examples when using 0.11 + the patches:
>>>>
>>>>
>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>> file
>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>>> - Aborting...
>>>>
>>>>
>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>> No lease on
>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>> pendingcreates: 24]
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>>> at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>>> at
>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>> at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>>> at
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>> file
>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>>> - Aborting...
>>>>
>>>>
>>>>
>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 10.55.1.103:50010
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>> java.io.IOException: Unable to create new block.
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>> file
>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>>> - Aborting...
>>>> 2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>>> Exception running child : org.apache.crunch.CrunchRuntimeException:
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 10.55.1.103:50010
>>>> at
>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>> at
>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>> at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>> 10.55.1.103:50010
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Curious how this went. :)
>>>>>
>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>>>> wrote:
>>>>>
>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>
>>>>>> as we also rely on 517.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related
>>>>>>> to this problem.)
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Everett,
>>>>>>>>
>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>>>> 553 patch? Is that easy to do?
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <everett@nuna.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>>>> increase the number of retries?
>>>>>>>>>
>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>> set to its default.
>>>>>>>>>
>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>>>>>> well as how Crunch uses side effect files? Do you know if HDFS would clean
>>>>>>>>> up those directories from underneath Crunch?
>>>>>>>>>
>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>> The failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>>>> SocketTimeoutException.
>>>>>>>>>
>>>>>>>>> Examples:
>>>>>>>>>
>>>>>>>>> [1] No lease exception
>>>>>>>>>
>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>> No lease on
>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>> File does not exist. Holder
>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>> any open files. at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>> at
>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>> No lease on
>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>> File does not exist. Holder
>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>> any open files. at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>> at
>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>>> ... 9 more
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [2] File does not exist
>>>>>>>>>
>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>
>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>>> 	... 9 more
>>>>>>>>>
>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>
>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>
>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>
>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>> missing temp files.
>>>>>>>>>>
>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> J
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>> data and have been trying out different AWS instance types in anticipation
>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>
>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>>>>>> with the CRUNCH-553
>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>
>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>
>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>
>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>
>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>
>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>>> No lease on
>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>>>> any open files.
>>>>>>>>>>>>
>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>
>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>
>>>>>>>>>>>> - Everett
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Director of Data Science
>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>>> may contain information that is confidential, proprietary in nature,
>>>>>> protected health information (PHI), or otherwise protected by law from
>>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
On Sat, Sep 26, 2015 at 2:15 PM, Josh Wills <jo...@gmail.com> wrote:

> You can mix in a combination of Pipeline.run and Pipeline.cleanup calls to
> control job execution and cleanup.


Thanks, Josh. I was somewhat familiar with run() but had never noticed
cleanup()!



>
> On Sat, Sep 26, 2015 at 1:48 PM Everett Anderson <ev...@nuna.com> wrote:
>
>> On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hrm. If you never call Pipeline.done, you should never cleanup the
>>> temporary files for the job...
>>>
>>
>> Interesting.
>>
>> We're currently exploring giving the datanodes more memory as there's
>> some evidence they were getting overloaded.
>>
>> Right now, our Crunch pipeline is long, with many stages, but not all
>> data is used in each stage. If our problem is that we're overloading some
>> part of HDFS (and in other cluster configs we have seen ourselves hit our
>> disk capacity cap), I wonder if it'd help if we DID somehow prune away
>> temporary outputs that were no longer necessary.
>>
>>
>>
>>
>>
>>
>>>
>>> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>> While we tried to take comfort in the fact that we'd only seen this
>>>> only HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>>>> larger amounts of data on SSD-based c3.4x8larges.
>>>>
>>>> My two hypotheses are
>>>>
>>>> 1) Somehow these temp files are getting cleaned up before they're
>>>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>>>> cleans up these temp directories, or perhaps there's a bunch in Crunch's
>>>> planner.
>>>>
>>>> 2) HDFS has chosen 3 machines to replicate data to, but it is
>>>> performing a very lopsided replication. While the cluster overall looks
>>>> like it has HDFS capacity, perhaps a small subset of the machines is
>>>> actually at capacity. Things seem to fail in obscure ways when running out
>>>> of disk.
>>>>
>>>>
>>>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>
>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>> 	... 9 more
>>>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>
>>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 	at java.lang.reflect.Method.invoke(Method.java:606)
>>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
>>>> 	... 22 more
>>>>
>>>>
>>>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:
>>>>
>>>>> Also worth noting, we inspected the hadoop configuration defaults that
>>>>> the AWS EMR service populates for the two different instance types, for
>>>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>>>> identical, with the exception of slight differences in JVM memory allotted.
>>>>> Further investigated the max number of file descriptors for each instance
>>>>> type via ulimit, and saw no differences there either.
>>>>>
>>>>> So not sure what the main difference is between these two clusters
>>>>> that would cause these very different outcomes, other than cc2.8xlarge
>>>>> having SSDs and c3.8xlarge having spinning disks.
>>>>>
>>>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
>>>>> wrote:
>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> Jeff graciously agreed to try it out.
>>>>>>
>>>>>> I'm afraid we're still getting failures on that instance type, though
>>>>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>>>>> applications could be submitted afterwards.
>>>>>>
>>>>>> The errors when running the pipeline seem to be similarly HDFS
>>>>>> related. It's quite odd.
>>>>>>
>>>>>> Examples when using 0.11 + the patches:
>>>>>>
>>>>>>
>>>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>>> file
>>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>>>>> - Aborting...
>>>>>>
>>>>>>
>>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>> No lease on
>>>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>>>> pendingcreates: 24]
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>> at
>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>>>
>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>>>> at
>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>> at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>> at
>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>>>>> at
>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>>> file
>>>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>>>>> - Aborting...
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>>>> 10.55.1.103:50010
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode
>>>>>> 10.55.1.103:50010
>>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>>> java.io.IOException: Unable to create new block.
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>>> file
>>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>>>>> - Aborting...
>>>>>> 2015-08-20 23:34:59,279 WARN [main]
>>>>>> org.apache.hadoop.mapred.YarnChild: Exception running child :
>>>>>> org.apache.crunch.CrunchRuntimeException: java.io.IOException: Bad connect
>>>>>> ack with firstBadLink as 10.55.1.103:50010
>>>>>> at
>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>> at
>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>>>> at
>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>>>> 10.55.1.103:50010
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Curious how this went. :)
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>>>
>>>>>>>> as we also rely on 517.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is
>>>>>>>>> related to this problem.)
>>>>>>>>>
>>>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Everett,
>>>>>>>>>>
>>>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2
>>>>>>>>>> w/the 553 patch? Is that easy to do?
>>>>>>>>>>
>>>>>>>>>> J
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <
>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>>>>>> increase the number of retries?
>>>>>>>>>>>
>>>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>>>> set to its default.
>>>>>>>>>>>
>>>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are
>>>>>>>>>>> as well as how Crunch uses side effect files? Do you know if HDFS would
>>>>>>>>>>> clean up those directories from underneath Crunch?
>>>>>>>>>>>
>>>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>>>> The failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>>>>>> SocketTimeoutException.
>>>>>>>>>>>
>>>>>>>>>>> Examples:
>>>>>>>>>>>
>>>>>>>>>>> [1] No lease exception
>>>>>>>>>>>
>>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>> No lease on
>>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>>>> any open files. at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>> No lease on
>>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>>>> any open files. at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>>>> at
>>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>>>> at
>>>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>>>>> ... 9 more
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [2] File does not exist
>>>>>>>>>>>
>>>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>>>
>>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>>>>> 	... 9 more
>>>>>>>>>>>
>>>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>>>
>>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <
>>>>>>>>>>>> jwills@cloudera.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>>>
>>>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>>>> missing temp files.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> J
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>>>> data and have been trying out different AWS instance types in anticipation
>>>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12
>>>>>>>>>>>>>> (patched with the CRUNCH-553
>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on
>>>>>>>>>>>>>> 45711] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>>>>> No lease on
>>>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>>>>>> any open files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Everett
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Director of Data Science
>>>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Director of Data Science
>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Director of Data Science
>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jo...@gmail.com>.
You can mix in a combination of Pipeline.run and Pipeline.cleanup calls to
control job execution and cleanup.
On Sat, Sep 26, 2015 at 1:48 PM Everett Anderson <ev...@nuna.com> wrote:

> On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hrm. If you never call Pipeline.done, you should never cleanup the
>> temporary files for the job...
>>
>
> Interesting.
>
> We're currently exploring giving the datanodes more memory as there's some
> evidence they were getting overloaded.
>
> Right now, our Crunch pipeline is long, with many stages, but not all data
> is used in each stage. If our problem is that we're overloading some part
> of HDFS (and in other cluster configs we have seen ourselves hit our disk
> capacity cap), I wonder if it'd help if we DID somehow prune away temporary
> outputs that were no longer necessary.
>
>
>
>
>
>
>>
>> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>> While we tried to take comfort in the fact that we'd only seen this only
>>> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>>> larger amounts of data on SSD-based c3.4x8larges.
>>>
>>> My two hypotheses are
>>>
>>> 1) Somehow these temp files are getting cleaned up before they're
>>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>>> cleans up these temp directories, or perhaps there's a bunch in Crunch's
>>> planner.
>>>
>>> 2) HDFS has chosen 3 machines to replicate data to, but it is performing
>>> a very lopsided replication. While the cluster overall looks like it has
>>> HDFS capacity, perhaps a small subset of the machines is actually at
>>> capacity. Things seem to fail in obscure ways when running out of disk.
>>>
>>>
>>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>> 	... 9 more
>>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> 	at java.lang.reflect.Method.invoke(Method.java:606)
>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
>>> 	... 22 more
>>>
>>>
>>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:
>>>
>>>> Also worth noting, we inspected the hadoop configuration defaults that
>>>> the AWS EMR service populates for the two different instance types, for
>>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>>> identical, with the exception of slight differences in JVM memory allotted.
>>>> Further investigated the max number of file descriptors for each instance
>>>> type via ulimit, and saw no differences there either.
>>>>
>>>> So not sure what the main difference is between these two clusters that
>>>> would cause these very different outcomes, other than cc2.8xlarge having
>>>> SSDs and c3.8xlarge having spinning disks.
>>>>
>>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Jeff graciously agreed to try it out.
>>>>>
>>>>> I'm afraid we're still getting failures on that instance type, though
>>>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>>>> applications could be submitted afterwards.
>>>>>
>>>>> The errors when running the pipeline seem to be similarly HDFS
>>>>> related. It's quite odd.
>>>>>
>>>>> Examples when using 0.11 + the patches:
>>>>>
>>>>>
>>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>>> pendingcreates: 24]
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>>
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>>>> - Aborting...
>>>>>
>>>>>
>>>>>
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>>> java.io.IOException: Unable to create new block.
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>>> file
>>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>>>> - Aborting...
>>>>> 2015-08-20 23:34:59,279 WARN [main]
>>>>> org.apache.hadoop.mapred.YarnChild: Exception running child :
>>>>> org.apache.crunch.CrunchRuntimeException: java.io.IOException: Bad connect
>>>>> ack with firstBadLink as 10.55.1.103:50010
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>>> at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>>> 10.55.1.103:50010
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Curious how this went. :)
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>>
>>>>>>> as we also rely on 517.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related
>>>>>>>> to this problem.)
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Everett,
>>>>>>>>>
>>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>>>>> 553 patch? Is that easy to do?
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <
>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>>>>> increase the number of retries?
>>>>>>>>>>
>>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>>> set to its default.
>>>>>>>>>>
>>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are
>>>>>>>>>> as well as how Crunch uses side effect files? Do you know if HDFS would
>>>>>>>>>> clean up those directories from underneath Crunch?
>>>>>>>>>>
>>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>>> The failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>>>>> SocketTimeoutException.
>>>>>>>>>>
>>>>>>>>>> Examples:
>>>>>>>>>>
>>>>>>>>>> [1] No lease exception
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>>> any open files. at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>>> at
>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>>>> at
>>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>>> at
>>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>>>> ... 9 more
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [2] File does not exist
>>>>>>>>>>
>>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>>
>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>>>> 	... 9 more
>>>>>>>>>>
>>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>>
>>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jwills@cloudera.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>>
>>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>>
>>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>>> missing temp files.
>>>>>>>>>>>
>>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> J
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>>> data and have been trying out different AWS instance types in anticipation
>>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12
>>>>>>>>>>>>> (patched with the CRUNCH-553
>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>>>> No lease on
>>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>>>>> any open files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Everett
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Director of Data Science
>>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Director of Data Science
>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
On Thu, Sep 24, 2015 at 5:46 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hrm. If you never call Pipeline.done, you should never cleanup the
> temporary files for the job...
>

Interesting.

We're currently exploring giving the datanodes more memory as there's some
evidence they were getting overloaded.

Right now, our Crunch pipeline is long, with many stages, but not all data
is used in each stage. If our problem is that we're overloading some part
of HDFS (and in other cluster configs we have seen ourselves hit our disk
capacity cap), I wonder if it'd help if we DID somehow prune away temporary
outputs that were no longer necessary.






>
> On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> While we tried to take comfort in the fact that we'd only seen this only
>> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
>> larger amounts of data on SSD-based c3.4x8larges.
>>
>> My two hypotheses are
>>
>> 1) Somehow these temp files are getting cleaned up before they're
>> accessed for the last time. Perhaps either something in HDFS or Hadoop
>> cleans up these temp directories, or perhaps there's a bunch in Crunch's
>> planner.
>>
>> 2) HDFS has chosen 3 machines to replicate data to, but it is performing
>> a very lopsided replication. While the cluster overall looks like it has
>> HDFS capacity, perhaps a small subset of the machines is actually at
>> capacity. Things seem to fail in obscure ways when running out of disk.
>>
>>
>> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>> 	... 9 more
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 	at java.lang.reflect.Method.invoke(Method.java:606)
>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
>> 	... 22 more
>>
>>
>> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:
>>
>>> Also worth noting, we inspected the hadoop configuration defaults that
>>> the AWS EMR service populates for the two different instance types, for
>>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>>> identical, with the exception of slight differences in JVM memory allotted.
>>> Further investigated the max number of file descriptors for each instance
>>> type via ulimit, and saw no differences there either.
>>>
>>> So not sure what the main difference is between these two clusters that
>>> would cause these very different outcomes, other than cc2.8xlarge having
>>> SSDs and c3.8xlarge having spinning disks.
>>>
>>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> Jeff graciously agreed to try it out.
>>>>
>>>> I'm afraid we're still getting failures on that instance type, though
>>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>>> applications could be submitted afterwards.
>>>>
>>>> The errors when running the pipeline seem to be similarly HDFS related.
>>>> It's quite odd.
>>>>
>>>> Examples when using 0.11 + the patches:
>>>>
>>>>
>>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>> file
>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>>> - Aborting...
>>>>
>>>>
>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>> No lease on
>>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>>> (inode 83784): File does not exist. [Lease.  Holder:
>>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>>> pendingcreates: 24]
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>>> at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>>> at
>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>>
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>>> at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>>> at
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>> file
>>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>>> - Aborting...
>>>>
>>>>
>>>>
>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 10.55.1.103:50010
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>>> java.io.IOException: Unable to create new block.
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>>> file
>>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>>> - Aborting...
>>>> 2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>>> Exception running child : org.apache.crunch.CrunchRuntimeException:
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 10.55.1.103:50010
>>>> at
>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>> at
>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>>> at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>>> 10.55.1.103:50010
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Curious how this went. :)
>>>>>
>>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>>>> wrote:
>>>>>
>>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>>
>>>>>> as we also rely on 517.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related
>>>>>>> to this problem.)
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Everett,
>>>>>>>>
>>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>>>> 553 patch? Is that easy to do?
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <everett@nuna.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>>>> increase the number of retries?
>>>>>>>>>
>>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>>> set to its default.
>>>>>>>>>
>>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>>>>>> well as how Crunch uses side effect files? Do you know if HDFS would clean
>>>>>>>>> up those directories from underneath Crunch?
>>>>>>>>>
>>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>>> The failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>>>> SocketTimeoutException.
>>>>>>>>>
>>>>>>>>> Examples:
>>>>>>>>>
>>>>>>>>> [1] No lease exception
>>>>>>>>>
>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>> No lease on
>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>> File does not exist. Holder
>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>> any open files. at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>>> at
>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>> No lease on
>>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>>> File does not exist. Holder
>>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>>> any open files. at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>>> at
>>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>>> ... 9 more
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [2] File does not exist
>>>>>>>>>
>>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>>
>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>>> 	... 9 more
>>>>>>>>>
>>>>>>>>> [3] SocketTimeoutException
>>>>>>>>>
>>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <
>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Everett,
>>>>>>>>>>>
>>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>>>> doing that here, right?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>>
>>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>>> missing temp files.
>>>>>>>>>>
>>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs
>>>>>>>>>> set to 1 to try to narrow down the originating failure.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> J
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>>> data and have been trying out different AWS instance types in anticipation
>>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>>
>>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>>>>>> with the CRUNCH-553
>>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>>
>>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>>
>>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>>
>>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>>
>>>>>>>>>>>> The biggest obvious hardware difference is that the
>>>>>>>>>>>> cc2.8xlarges use hard disks instead of SSDs.
>>>>>>>>>>>>
>>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>>> No lease on
>>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>>>> any open files.
>>>>>>>>>>>>
>>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>>> hypothesis.
>>>>>>>>>>>>
>>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>>
>>>>>>>>>>>> - Everett
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Director of Data Science
>>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>>> may contain information that is confidential, proprietary in nature,
>>>>>> protected health information (PHI), or otherwise protected by law from
>>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jw...@cloudera.com>.
Hrm. If you never call Pipeline.done, you should never cleanup the
temporary files for the job...

On Thu, Sep 24, 2015 at 5:44 PM, Everett Anderson <ev...@nuna.com> wrote:

> While we tried to take comfort in the fact that we'd only seen this only
> HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
> larger amounts of data on SSD-based c3.4x8larges.
>
> My two hypotheses are
>
> 1) Somehow these temp files are getting cleaned up before they're accessed
> for the last time. Perhaps either something in HDFS or Hadoop cleans up
> these temp directories, or perhaps there's a bunch in Crunch's planner.
>
> 2) HDFS has chosen 3 machines to replicate data to, but it is performing a
> very lopsided replication. While the cluster overall looks like it has HDFS
> capacity, perhaps a small subset of the machines is actually at capacity.
> Things seem to fail in obscure ways when running out of disk.
>
>
> 2015-09-24 23:28:58,850 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-2031291770/p567/REDUCE
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
> 	... 9 more
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/crunch-2031291770/p567/REDUCE
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
> 	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
> 	... 22 more
>
>
> On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:
>
>> Also worth noting, we inspected the hadoop configuration defaults that
>> the AWS EMR service populates for the two different instance types, for
>> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
>> identical, with the exception of slight differences in JVM memory allotted.
>> Further investigated the max number of file descriptors for each instance
>> type via ulimit, and saw no differences there either.
>>
>> So not sure what the main difference is between these two clusters that
>> would cause these very different outcomes, other than cc2.8xlarge having
>> SSDs and c3.8xlarge having spinning disks.
>>
>> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>> Hey,
>>>
>>> Jeff graciously agreed to try it out.
>>>
>>> I'm afraid we're still getting failures on that instance type, though
>>> with 0.11 with the patches, the cluster ended up in a state that no new
>>> applications could be submitted afterwards.
>>>
>>> The errors when running the pipeline seem to be similarly HDFS related.
>>> It's quite odd.
>>>
>>> Examples when using 0.11 + the patches:
>>>
>>>
>>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>> file
>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>>> - Aborting...
>>>
>>>
>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>>> (inode 83784): File does not exist. [Lease.  Holder:
>>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>>> pendingcreates: 24]
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>>> at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>>> at
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>>
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>>> at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>>> at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>> at
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>> file
>>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>>> - Aborting...
>>>
>>>
>>>
>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>>> java.io.IOException: Bad connect ack with firstBadLink as
>>> 10.55.1.103:50010
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>>> java.io.IOException: Unable to create new block.
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>>> file
>>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>>> - Aborting...
>>> 2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>> Exception running child : org.apache.crunch.CrunchRuntimeException:
>>> java.io.IOException: Bad connect ack with firstBadLink as
>>> 10.55.1.103:50010
>>> at
>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>> at
>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>>> 10.55.1.103:50010
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com>
>>> wrote:
>>>
>>>> Curious how this went. :)
>>>>
>>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>>> wrote:
>>>>
>>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>>
>>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>>
>>>>> as we also rely on 517.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related
>>>>>> to this problem.)
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Everett,
>>>>>>>
>>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>>> 553 patch? Is that easy to do?
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>>> increase the number of retries?
>>>>>>>>
>>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs
>>>>>>>> set to its default.
>>>>>>>>
>>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>>>>> well as how Crunch uses side effect files? Do you know if HDFS would clean
>>>>>>>> up those directories from underneath Crunch?
>>>>>>>>
>>>>>>>> There are usually 4 failed applications, failing due to reduces.
>>>>>>>> The failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>>> SocketTimeoutException.
>>>>>>>>
>>>>>>>> Examples:
>>>>>>>>
>>>>>>>> [1] No lease exception
>>>>>>>>
>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>> No lease on
>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>> File does not exist. Holder
>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>> any open files. at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>>> at
>>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>> No lease on
>>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>>> File does not exist. Holder
>>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>>> any open files. at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>> at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>>> at
>>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>>> ... 9 more
>>>>>>>>
>>>>>>>>
>>>>>>>> [2] File does not exist
>>>>>>>>
>>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>>
>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>>> 	... 9 more
>>>>>>>>
>>>>>>>> [3] SocketTimeoutException
>>>>>>>>
>>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <everett@nuna.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Everett,
>>>>>>>>>>
>>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>>> doing that here, right?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>>
>>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>>> missing temp files.
>>>>>>>>>
>>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set
>>>>>>>>> to 1 to try to narrow down the originating failure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> J
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I recently started trying to run our Crunch pipeline on more
>>>>>>>>>>> data and have been trying out different AWS instance types in anticipation
>>>>>>>>>>> of our storage and compute needs.
>>>>>>>>>>>
>>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>>>>> with the CRUNCH-553
>>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>>
>>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>>
>>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>>
>>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>>
>>>>>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges
>>>>>>>>>>> use hard disks instead of SSDs.
>>>>>>>>>>>
>>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>>
>>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>>> No lease on
>>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>>> File does not exist. Holder
>>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>>> any open files.
>>>>>>>>>>>
>>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>>> hypothesis.
>>>>>>>>>>>
>>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>>
>>>>>>>>>>> - Everett
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Director of Data Science
>>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>> error, please notify the sender of this email. Please delete this and all
>>>>> copies of this email from your system. Any opinions either expressed or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
While we tried to take comfort in the fact that we'd only seen this only
HD-based cc2.8xlarges, I'm afraid we're now seeing it when processing
larger amounts of data on SSD-based c3.4x8larges.

My two hypotheses are

1) Somehow these temp files are getting cleaned up before they're accessed
for the last time. Perhaps either something in HDFS or Hadoop cleans up
these temp directories, or perhaps there's a bunch in Crunch's planner.

2) HDFS has chosen 3 machines to replicate data to, but it is performing a
very lopsided replication. While the cluster overall looks like it has HDFS
capacity, perhaps a small subset of the machines is actually at capacity.
Things seem to fail in obscure ways when running out of disk.


2015-09-24 23:28:58,850 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
org.apache.crunch.CrunchRuntimeException: Could not read runtime node
information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist:
/tmp/crunch-2031291770/p567/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
	... 9 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
File does not exist: /tmp/crunch-2031291770/p567/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at org.apache.hadoop.ipc.Client.call(Client.java:1410)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
	at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:219)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1145)
	... 22 more


On Fri, Aug 21, 2015 at 3:52 PM, Jeff Quinn <je...@nuna.com> wrote:

> Also worth noting, we inspected the hadoop configuration defaults that the
> AWS EMR service populates for the two different instance types, for
> mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
> identical, with the exception of slight differences in JVM memory allotted.
> Further investigated the max number of file descriptors for each instance
> type via ulimit, and saw no differences there either.
>
> So not sure what the main difference is between these two clusters that
> would cause these very different outcomes, other than cc2.8xlarge having
> SSDs and c3.8xlarge having spinning disks.
>
> On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> Hey,
>>
>> Jeff graciously agreed to try it out.
>>
>> I'm afraid we're still getting failures on that instance type, though
>> with 0.11 with the patches, the cluster ended up in a state that no new
>> applications could be submitted afterwards.
>>
>> The errors when running the pipeline seem to be similarly HDFS related.
>> It's quite odd.
>>
>> Examples when using 0.11 + the patches:
>>
>>
>> 2015-08-20 23:17:50,455 WARN [Thread-38]
>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>> file
>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
>> - Aborting...
>>
>>
>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
>> (inode 83784): File does not exist. [Lease.  Holder:
>> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
>> pendingcreates: 24]
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>>
>> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
>> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
>> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>> 2015-08-20 22:39:42,184 WARN [Thread-51]
>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>> file
>> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
>> - Aborting...
>>
>>
>>
>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> java.io.IOException: Bad connect ack with firstBadLink as
>> 10.55.1.103:50010
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>> 2015-08-20 23:34:59,276 INFO [Thread-37]
>> org.apache.hadoop.hdfs.DFSClient: Abandoning
>> BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
>> 2015-08-20 23:34:59,278 INFO [Thread-37]
>> org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.55.1.103:50010
>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
>> java.io.IOException: Unable to create new block.
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>> 2015-08-20 23:34:59,278 WARN [Thread-37]
>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>> file
>> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
>> - Aborting...
>> 2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
>> Exception running child : org.apache.crunch.CrunchRuntimeException:
>> java.io.IOException: Bad connect ack with firstBadLink as
>> 10.55.1.103:50010
>> at
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>> at
>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
>> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
>> 10.55.1.103:50010
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Curious how this went. :)
>>>
>>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>>
>>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>>
>>>> as we also rely on 517.
>>>>
>>>>
>>>>
>>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> (In particular, I'm wondering if something in CRUNCH-481 is related to
>>>>> this problem.)
>>>>>
>>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Everett,
>>>>>>
>>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the
>>>>>> 553 patch? Is that easy to do?
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally
>>>>>>> feel like the pipeline application itself logic is sound, at this point. It
>>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>>> increase the number of retries?
>>>>>>>
>>>>>>> It reliably fails on this hardware when crunch.max.running.jobs set
>>>>>>> to its default.
>>>>>>>
>>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>>>> well as how Crunch uses side effect files? Do you know if HDFS would clean
>>>>>>> up those directories from underneath Crunch?
>>>>>>>
>>>>>>> There are usually 4 failed applications, failing due to reduces. The
>>>>>>> failures seem to be one of the following three kinds -- (1) No lease on
>>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>>> SocketTimeoutException.
>>>>>>>
>>>>>>> Examples:
>>>>>>>
>>>>>>> [1] No lease exception
>>>>>>>
>>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>> No lease on
>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>> File does not exist. Holder
>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>> any open files. at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>> at
>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>>> at
>>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>> No lease on
>>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>>> File does not exist. Holder
>>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>>> any open files. at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>> at
>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>> at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>>> at
>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>>> at
>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>>> at
>>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>>> at
>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>>> at
>>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>>> at
>>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>>> ... 9 more
>>>>>>>
>>>>>>>
>>>>>>> [2] File does not exist
>>>>>>>
>>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>>
>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>>> 	... 9 more
>>>>>>>
>>>>>>> [3] SocketTimeoutException
>>>>>>>
>>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Everett,
>>>>>>>>>
>>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>>> doing that here, right?
>>>>>>>>>
>>>>>>>>
>>>>>>>> We're reading from and writing to HDFS, here. (We've copied in
>>>>>>>> input from S3 to HDFS in another step.)
>>>>>>>>
>>>>>>>> There are a few exceptions in the logs. Most seem related to
>>>>>>>> missing temp files.
>>>>>>>>
>>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set
>>>>>>>> to 1 to try to narrow down the originating failure.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <
>>>>>>>>> everett@nuna.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I recently started trying to run our Crunch pipeline on more data
>>>>>>>>>> and have been trying out different AWS instance types in anticipation of
>>>>>>>>>> our storage and compute needs.
>>>>>>>>>>
>>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>>>> with the CRUNCH-553
>>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>>
>>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>>
>>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>>
>>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>>
>>>>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges
>>>>>>>>>> use hard disks instead of SSDs.
>>>>>>>>>>
>>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>>
>>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>>> No lease on
>>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>>> File does not exist. Holder
>>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>>> any open files.
>>>>>>>>>>
>>>>>>>>>> Those paths look like these side effect files
>>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>>> hypothesis.
>>>>>>>>>>
>>>>>>>>>> A sample full log is attached.
>>>>>>>>>>
>>>>>>>>>> Thanks for any guidance!
>>>>>>>>>>
>>>>>>>>>> - Everett
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Director of Data Science
>>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Jeff Quinn <je...@nuna.com>.
Also worth noting, we inspected the hadoop configuration defaults that the
AWS EMR service populates for the two different instance types, for
mapred-site.xml, core-site.xml, and hdfs-site.xml all settings were
identical, with the exception of slight differences in JVM memory allotted.
Further investigated the max number of file descriptors for each instance
type via ulimit, and saw no differences there either.

So not sure what the main difference is between these two clusters that
would cause these very different outcomes, other than cc2.8xlarge having
SSDs and c3.8xlarge having spinning disks.

On Fri, Aug 21, 2015 at 1:03 PM, Everett Anderson <ev...@nuna.com> wrote:

> Hey,
>
> Jeff graciously agreed to try it out.
>
> I'm afraid we're still getting failures on that instance type, though with
> 0.11 with the patches, the cluster ended up in a state that no new
> applications could be submitted afterwards.
>
> The errors when running the pipeline seem to be similarly HDFS related.
> It's quite odd.
>
> Examples when using 0.11 + the patches:
>
>
> 2015-08-20 23:17:50,455 WARN [Thread-38] org.apache.hadoop.hdfs.DFSClient:
> Could not get block locations. Source file
> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
> - Aborting...
>
>
> 2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
> (inode 83784): File does not exist. [Lease.  Holder:
> DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
> pendingcreates: 24]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
> at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient:
> Could not get block locations. Source file
> "/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
> - Aborting...
>
>
>
> 2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.55.1.103:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
> Abandoning BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
> 2015-08-20 23:34:59,278 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
> Excluding datanode 10.55.1.103:50010
> 2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception
> java.io.IOException: Unable to create new block.
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
> 2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient:
> Could not get block locations. Source file
> "/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
> - Aborting...
> 2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : org.apache.crunch.CrunchRuntimeException:
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.55.1.103:50010
> at
> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
> at
> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
> Caused by: java.io.IOException: Bad connect ack with firstBadLink as
> 10.55.1.103:50010
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
>
>
>
>
>
>
>
>
>
> On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Curious how this went. :)
>>
>> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>>
>>> https://issues.apache.org/jira/browse/CRUNCH-553
>>> https://issues.apache.org/jira/browse/CRUNCH-517
>>>
>>> as we also rely on 517.
>>>
>>>
>>>
>>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> (In particular, I'm wondering if something in CRUNCH-481 is related to
>>>> this problem.)
>>>>
>>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Everett,
>>>>>
>>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
>>>>> patch? Is that easy to do?
>>>>>
>>>>> J
>>>>>
>>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge
>>>>>> hardware when setting crunch.max.running.jobs to 1. I generally feel
>>>>>> like the pipeline application itself logic is sound, at this point. It
>>>>>> could be that this is just taxing these machines too hard and we need to
>>>>>> increase the number of retries?
>>>>>>
>>>>>> It reliably fails on this hardware when crunch.max.running.jobs set
>>>>>> to its default.
>>>>>>
>>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>>> well as how Crunch uses side effect files? Do you know if HDFS would clean
>>>>>> up those directories from underneath Crunch?
>>>>>>
>>>>>> There are usually 4 failed applications, failing due to reduces. The
>>>>>> failures seem to be one of the following three kinds -- (1) No lease on
>>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>>> SocketTimeoutException.
>>>>>>
>>>>>> Examples:
>>>>>>
>>>>>> [1] No lease exception
>>>>>>
>>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>> No lease on
>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>> File does not exist. Holder
>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>> any open files. at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>> at
>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>>> at
>>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>> No lease on
>>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>>> File does not exist. Holder
>>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>>> any open files. at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>> at
>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>> at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>>> at
>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>>> at
>>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>>> at
>>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>>> at
>>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>>> at
>>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>>> ... 9 more
>>>>>>
>>>>>>
>>>>>> [2] File does not exist
>>>>>>
>>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>>
>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>>> 	... 9 more
>>>>>>
>>>>>> [3] SocketTimeoutException
>>>>>>
>>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Everett,
>>>>>>>>
>>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>>> doing that here, right?
>>>>>>>>
>>>>>>>
>>>>>>> We're reading from and writing to HDFS, here. (We've copied in input
>>>>>>> from S3 to HDFS in another step.)
>>>>>>>
>>>>>>> There are a few exceptions in the logs. Most seem related to missing
>>>>>>> temp files.
>>>>>>>
>>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set
>>>>>>> to 1 to try to narrow down the originating failure.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <everett@nuna.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I recently started trying to run our Crunch pipeline on more data
>>>>>>>>> and have been trying out different AWS instance types in anticipation of
>>>>>>>>> our storage and compute needs.
>>>>>>>>>
>>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>>> with the CRUNCH-553
>>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>>
>>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>>
>>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>>
>>>>>>>>> However, it always fails on the same data when using 10
>>>>>>>>> cc2.8xlarge Core instances.
>>>>>>>>>
>>>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges
>>>>>>>>> use hard disks instead of SSDs.
>>>>>>>>>
>>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>>> failure, I think it's from errors like:
>>>>>>>>>
>>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>>> No lease on
>>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>>> File does not exist. Holder
>>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>>> any open files.
>>>>>>>>>
>>>>>>>>> Those paths look like these side effect files
>>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Would Crunch have generated applications that depend on side
>>>>>>>>> effect paths as input across MapReduce applications and something in HDFS
>>>>>>>>> is cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>>> hypothesis.
>>>>>>>>>
>>>>>>>>> A sample full log is attached.
>>>>>>>>>
>>>>>>>>> Thanks for any guidance!
>>>>>>>>>
>>>>>>>>> - Everett
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Director of Data Science
>>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>>> may contain information that is confidential, proprietary in nature,
>>>>>> protected health information (PHI), or otherwise protected by law from
>>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
Hey,

Jeff graciously agreed to try it out.

I'm afraid we're still getting failures on that instance type, though with
0.11 with the patches, the cluster ended up in a state that no new
applications could be submitted afterwards.

The errors when running the pipeline seem to be similarly HDFS related.
It's quite odd.

Examples when using 0.11 + the patches:


2015-08-20 23:17:50,455 WARN [Thread-38] org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_0/out0-r-00001"
- Aborting...


2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167
(inode 83784): File does not exist. [Lease.  Holder:
DFSClient_attempt_1440102643297_0103_r_000167_2_964529009_1,
pendingcreates: 24]
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.abandonBlock(FSNamesystem.java:3486)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.abandonBlock(NameNodeRpcServer.java:687)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.abandonBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:467)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:635)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:376)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.abandonBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1377)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-08-20 22:39:42,184 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/tmp/crunch-274499863/p510/output/_temporary/1/_temporary/attempt_1440102643297_out12_0103_r_000167_2/out12-r-00167"
- Aborting...



2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-08-20 23:34:59,276 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Abandoning BP-835517662-10.55.1.32-1440102626965:blk_1073828261_95268
2015-08-20 23:34:59,278 INFO [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Excluding datanode 10.55.1.103:50010
2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient:
DataStreamer Exception
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1386)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
2015-08-20 23:34:59,278 WARN [Thread-37] org.apache.hadoop.hdfs.DFSClient:
Could not get block locations. Source file
"/tmp/crunch-274499863/p504/output/_temporary/1/_temporary/attempt_1440102643297_out0_0107_r_000001_2/out0-r-00001"
- Aborting...
2015-08-20 23:34:59,279 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : org.apache.crunch.CrunchRuntimeException:
java.io.IOException: Bad connect ack with firstBadLink as 10.55.1.103:50010
at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
at
org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
Caused by: java.io.IOException: Bad connect ack with firstBadLink as
10.55.1.103:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)









On Fri, Aug 21, 2015 at 11:59 AM, Josh Wills <jw...@cloudera.com> wrote:

> Curious how this went. :)
>
> On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>>
>> https://issues.apache.org/jira/browse/CRUNCH-553
>> https://issues.apache.org/jira/browse/CRUNCH-517
>>
>> as we also rely on 517.
>>
>>
>>
>> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> (In particular, I'm wondering if something in CRUNCH-481 is related to
>>> this problem.)
>>>
>>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> Hey Everett,
>>>>
>>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
>>>> patch? Is that easy to do?
>>>>
>>>> J
>>>>
>>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I verified that the pipeline succeeds on the same cc2.8xlarge hardware
>>>>> when setting crunch.max.running.jobs to 1. I generally feel like the
>>>>> pipeline application itself logic is sound, at this point. It could be that
>>>>> this is just taxing these machines too hard and we need to increase the
>>>>> number of retries?
>>>>>
>>>>> It reliably fails on this hardware when crunch.max.running.jobs set
>>>>> to its default.
>>>>>
>>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as
>>>>> well as how Crunch uses side effect files? Do you know if HDFS would clean
>>>>> up those directories from underneath Crunch?
>>>>>
>>>>> There are usually 4 failed applications, failing due to reduces. The
>>>>> failures seem to be one of the following three kinds -- (1) No lease on
>>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>>> SocketTimeoutException.
>>>>>
>>>>> Examples:
>>>>>
>>>>> [1] No lease exception
>>>>>
>>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>> File does not exist. Holder
>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>> any open files. at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>>> at
>>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>>> File does not exist. Holder
>>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>>> any open files. at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>>> at
>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>> at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>>> at
>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>>> at
>>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>>> at
>>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>>> at
>>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>>> at
>>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>>> ... 9 more
>>>>>
>>>>>
>>>>> [2] File does not exist
>>>>>
>>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>>
>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>>> 	... 9 more
>>>>>
>>>>> [3] SocketTimeoutException
>>>>>
>>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Everett,
>>>>>>>
>>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>>> other errors showed up in the app master, although there are reports of
>>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>>> doing that here, right?
>>>>>>>
>>>>>>
>>>>>> We're reading from and writing to HDFS, here. (We've copied in input
>>>>>> from S3 to HDFS in another step.)
>>>>>>
>>>>>> There are a few exceptions in the logs. Most seem related to missing
>>>>>> temp files.
>>>>>>
>>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set to
>>>>>> 1 to try to narrow down the originating failure.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I recently started trying to run our Crunch pipeline on more data
>>>>>>>> and have been trying out different AWS instance types in anticipation of
>>>>>>>> our storage and compute needs.
>>>>>>>>
>>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched
>>>>>>>> with the CRUNCH-553
>>>>>>>> <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>>>>>
>>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>>
>>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>>
>>>>>>>> However, it always fails on the same data when using 10 cc2.8xlarge
>>>>>>>> Core instances.
>>>>>>>>
>>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges
>>>>>>>> use hard disks instead of SSDs.
>>>>>>>>
>>>>>>>> While it's a little hard to track down the exact originating
>>>>>>>> failure, I think it's from errors like:
>>>>>>>>
>>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>>> No lease on
>>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>>> File does not exist. Holder
>>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>>> any open files.
>>>>>>>>
>>>>>>>> Those paths look like these side effect files
>>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>>> .
>>>>>>>>
>>>>>>>> Would Crunch have generated applications that depend on side effect
>>>>>>>> paths as input across MapReduce applications and something in HDFS is
>>>>>>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>>> hypothesis.
>>>>>>>>
>>>>>>>> A sample full log is attached.
>>>>>>>>
>>>>>>>> Thanks for any guidance!
>>>>>>>>
>>>>>>>> - Everett
>>>>>>>>
>>>>>>>>
>>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera <http://www.cloudera.com>
>>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>> error, please notify the sender of this email. Please delete this and all
>>>>> copies of this email from your system. Any opinions either expressed or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jw...@cloudera.com>.
Curious how this went. :)

On Tue, Aug 18, 2015 at 4:26 PM, Everett Anderson <ev...@nuna.com> wrote:

> Sure, let me give it a try. I'm going to take 0.11 and patch it with
>
> https://issues.apache.org/jira/browse/CRUNCH-553
> https://issues.apache.org/jira/browse/CRUNCH-517
>
> as we also rely on 517.
>
>
>
> On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> (In particular, I'm wondering if something in CRUNCH-481 is related to
>> this problem.)
>>
>> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Everett,
>>>
>>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
>>> patch? Is that easy to do?
>>>
>>> J
>>>
>>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I verified that the pipeline succeeds on the same cc2.8xlarge hardware
>>>> when setting crunch.max.running.jobs to 1. I generally feel like the
>>>> pipeline application itself logic is sound, at this point. It could be that
>>>> this is just taxing these machines too hard and we need to increase the
>>>> number of retries?
>>>>
>>>> It reliably fails on this hardware when crunch.max.running.jobs set to
>>>> its default.
>>>>
>>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as well
>>>> as how Crunch uses side effect files? Do you know if HDFS would clean up
>>>> those directories from underneath Crunch?
>>>>
>>>> There are usually 4 failed applications, failing due to reduces. The
>>>> failures seem to be one of the following three kinds -- (1) No lease on
>>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>>> SocketTimeoutException.
>>>>
>>>> Examples:
>>>>
>>>> [1] No lease exception
>>>>
>>>> Error: org.apache.crunch.CrunchRuntimeException:
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>> No lease on
>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>> File does not exist. Holder
>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>> any open files. at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>> at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>> at
>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>>> at
>>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>> No lease on
>>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>>> File does not exist. Holder
>>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>>> any open files. at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>>> at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>>> at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>>> at
>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>>> at
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>>> at
>>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>>> at
>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>>> at
>>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>>> at
>>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>>> at
>>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>>> ... 9 more
>>>>
>>>>
>>>> [2] File does not exist
>>>>
>>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>>
>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>>> 	... 9 more
>>>>
>>>> [3] SocketTimeoutException
>>>>
>>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Everett,
>>>>>>
>>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>>> other errors showed up in the app master, although there are reports of
>>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>>> doing that here, right?
>>>>>>
>>>>>
>>>>> We're reading from and writing to HDFS, here. (We've copied in input
>>>>> from S3 to HDFS in another step.)
>>>>>
>>>>> There are a few exceptions in the logs. Most seem related to missing
>>>>> temp files.
>>>>>
>>>>> Let me see if I can reproduce it with crunch.max.running.jobs set to
>>>>> 1 to try to narrow down the originating failure.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I recently started trying to run our Crunch pipeline on more data
>>>>>>> and have been trying out different AWS instance types in anticipation of
>>>>>>> our storage and compute needs.
>>>>>>>
>>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with
>>>>>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553>
>>>>>>> fix).
>>>>>>>
>>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>>
>>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>>
>>>>>>> However, it always fails on the same data when using 10 cc2.8xlarge
>>>>>>> Core instances.
>>>>>>>
>>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>>>>>> hard disks instead of SSDs.
>>>>>>>
>>>>>>> While it's a little hard to track down the exact originating
>>>>>>> failure, I think it's from errors like:
>>>>>>>
>>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>>> No lease on
>>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>>> File does not exist. Holder
>>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>>> any open files.
>>>>>>>
>>>>>>> Those paths look like these side effect files
>>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>>> .
>>>>>>>
>>>>>>> Would Crunch have generated applications that depend on side effect
>>>>>>> paths as input across MapReduce applications and something in HDFS is
>>>>>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>>> hypothesis.
>>>>>>>
>>>>>>> A sample full log is attached.
>>>>>>>
>>>>>>> Thanks for any guidance!
>>>>>>>
>>>>>>> - Everett
>>>>>>>
>>>>>>>
>>>>>>> *DISCLAIMER:* The contents of this email, including any
>>>>>>> attachments, may contain information that is confidential, proprietary in
>>>>>>> nature, protected health information (PHI), or otherwise protected by law
>>>>>>> from disclosure, and is solely for the use of the intended recipient(s). If
>>>>>>> you are not the intended recipient, you are hereby notified that any use,
>>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
Sure, let me give it a try. I'm going to take 0.11 and patch it with

https://issues.apache.org/jira/browse/CRUNCH-553
https://issues.apache.org/jira/browse/CRUNCH-517

as we also rely on 517.



On Tue, Aug 18, 2015 at 4:09 PM, Josh Wills <jw...@cloudera.com> wrote:

> (In particular, I'm wondering if something in CRUNCH-481 is related to
> this problem.)
>
> On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Everett,
>>
>> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
>> patch? Is that easy to do?
>>
>> J
>>
>> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I verified that the pipeline succeeds on the same cc2.8xlarge hardware
>>> when setting crunch.max.running.jobs to 1. I generally feel like the
>>> pipeline application itself logic is sound, at this point. It could be that
>>> this is just taxing these machines too hard and we need to increase the
>>> number of retries?
>>>
>>> It reliably fails on this hardware when crunch.max.running.jobs set to
>>> its default.
>>>
>>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as well
>>> as how Crunch uses side effect files? Do you know if HDFS would clean up
>>> those directories from underneath Crunch?
>>>
>>> There are usually 4 failed applications, failing due to reduces. The
>>> failures seem to be one of the following three kinds -- (1) No lease on
>>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>>> SocketTimeoutException.
>>>
>>> Examples:
>>>
>>> [1] No lease exception
>>>
>>> Error: org.apache.crunch.CrunchRuntimeException:
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>> File does not exist. Holder
>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>> any open files. at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>> at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>> at
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>> java.security.AccessController.doPrivileged(Native Method) at
>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>>> at
>>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>>> java.security.AccessController.doPrivileged(Native Method) at
>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>>> File does not exist. Holder
>>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>>> any open files. at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>>> at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>>> at
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>>> java.security.AccessController.doPrivileged(Native Method) at
>>> javax.security.auth.Subject.doAs(Subject.java:415) at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>>> at
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>>> at
>>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>>> at
>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>>> at
>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>>> at
>>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>>> ... 9 more
>>>
>>>
>>> [2] File does not exist
>>>
>>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>>> 	... 9 more
>>>
>>> [3] SocketTimeoutException
>>>
>>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Everett,
>>>>>
>>>>> Initial thought-- there are lots of reasons for lease expired
>>>>> exceptions, and their usually more symptomatic of other problems in the
>>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>>> other errors showed up in the app master, although there are reports of
>>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>>> doing that here, right?
>>>>>
>>>>
>>>> We're reading from and writing to HDFS, here. (We've copied in input
>>>> from S3 to HDFS in another step.)
>>>>
>>>> There are a few exceptions in the logs. Most seem related to missing
>>>> temp files.
>>>>
>>>> Let me see if I can reproduce it with crunch.max.running.jobs set to 1
>>>> to try to narrow down the originating failure.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> J
>>>>>
>>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I recently started trying to run our Crunch pipeline on more data and
>>>>>> have been trying out different AWS instance types in anticipation of our
>>>>>> storage and compute needs.
>>>>>>
>>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with
>>>>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553>
>>>>>> fix).
>>>>>>
>>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>>
>>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>>
>>>>>> However, it always fails on the same data when using 10 cc2.8xlarge
>>>>>> Core instances.
>>>>>>
>>>>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>>>>> hard disks instead of SSDs.
>>>>>>
>>>>>> While it's a little hard to track down the exact originating failure,
>>>>>> I think it's from errors like:
>>>>>>
>>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>>> No lease on
>>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>>> File does not exist. Holder
>>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>>> any open files.
>>>>>>
>>>>>> Those paths look like these side effect files
>>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>>> .
>>>>>>
>>>>>> Would Crunch have generated applications that depend on side effect
>>>>>> paths as input across MapReduce applications and something in HDFS is
>>>>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>>> configures Hadoop differently for each instance type, and might have more
>>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>>> hypothesis.
>>>>>>
>>>>>> A sample full log is attached.
>>>>>>
>>>>>> Thanks for any guidance!
>>>>>>
>>>>>> - Everett
>>>>>>
>>>>>>
>>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>>> may contain information that is confidential, proprietary in nature,
>>>>>> protected health information (PHI), or otherwise protected by law from
>>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jw...@cloudera.com>.
(In particular, I'm wondering if something in CRUNCH-481 is related to this
problem.)

On Tue, Aug 18, 2015 at 4:07 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Everett,
>
> Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
> patch? Is that easy to do?
>
> J
>
> On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> Hi,
>>
>> I verified that the pipeline succeeds on the same cc2.8xlarge hardware
>> when setting crunch.max.running.jobs to 1. I generally feel like the
>> pipeline application itself logic is sound, at this point. It could be that
>> this is just taxing these machines too hard and we need to increase the
>> number of retries?
>>
>> It reliably fails on this hardware when crunch.max.running.jobs set to
>> its default.
>>
>> Can you explain a little what the /tmp/crunch-XXXXXXX files are as well
>> as how Crunch uses side effect files? Do you know if HDFS would clean up
>> those directories from underneath Crunch?
>>
>> There are usually 4 failed applications, failing due to reduces. The
>> failures seem to be one of the following three kinds -- (1) No lease on
>> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
>> SocketTimeoutException.
>>
>> Examples:
>>
>> [1] No lease exception
>>
>> Error: org.apache.crunch.CrunchRuntimeException:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>> File does not exist. Holder
>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>> any open files. at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
>> at
>> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
>> File does not exist. Holder
>> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
>> any open files. at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
>> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
>> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606) at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
>> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
>> at
>> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
>> at
>> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
>> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
>> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
>> ... 9 more
>>
>>
>> [2] File does not exist
>>
>> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
>> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
>> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
>> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
>> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
>> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
>> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
>> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
>> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
>> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
>> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
>> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
>> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
>> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
>> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
>> 	... 9 more
>>
>> [3] SocketTimeoutException
>>
>> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com> wrote:
>>>
>>>> Hey Everett,
>>>>
>>>> Initial thought-- there are lots of reasons for lease expired
>>>> exceptions, and their usually more symptomatic of other problems in the
>>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>>> other errors showed up in the app master, although there are reports of
>>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>>> doing that here, right?
>>>>
>>>
>>> We're reading from and writing to HDFS, here. (We've copied in input
>>> from S3 to HDFS in another step.)
>>>
>>> There are a few exceptions in the logs. Most seem related to missing
>>> temp files.
>>>
>>> Let me see if I can reproduce it with crunch.max.running.jobs set to 1
>>> to try to narrow down the originating failure.
>>>
>>>
>>>
>>>
>>>>
>>>> J
>>>>
>>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I recently started trying to run our Crunch pipeline on more data and
>>>>> have been trying out different AWS instance types in anticipation of our
>>>>> storage and compute needs.
>>>>>
>>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with
>>>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553>
>>>>> fix).
>>>>>
>>>>> Our pipeline finishes fine in these cluster configurations:
>>>>>
>>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>>
>>>>> However, it always fails on the same data when using 10 cc2.8xlarge
>>>>> Core instances.
>>>>>
>>>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>>>> hard disks instead of SSDs.
>>>>>
>>>>> While it's a little hard to track down the exact originating failure,
>>>>> I think it's from errors like:
>>>>>
>>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>>> org.apache.crunch.CrunchRuntimeException:
>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>>> No lease on
>>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>>> File does not exist. Holder
>>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>>> any open files.
>>>>>
>>>>> Those paths look like these side effect files
>>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>>> .
>>>>>
>>>>> Would Crunch have generated applications that depend on side effect
>>>>> paths as input across MapReduce applications and something in HDFS is
>>>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>>>> configures Hadoop differently for each instance type, and might have more
>>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>>> hypothesis.
>>>>>
>>>>> A sample full log is attached.
>>>>>
>>>>> Thanks for any guidance!
>>>>>
>>>>> - Everett
>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>> error, please notify the sender of this email. Please delete this and all
>>>>> copies of this email from your system. Any opinions either expressed or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jw...@cloudera.com>.
Hey Everett,

Shot in the dark-- would you mind trying it w/0.11.0-hadoop2 w/the 553
patch? Is that easy to do?

J

On Tue, Aug 18, 2015 at 3:18 PM, Everett Anderson <ev...@nuna.com> wrote:

> Hi,
>
> I verified that the pipeline succeeds on the same cc2.8xlarge hardware
> when setting crunch.max.running.jobs to 1. I generally feel like the
> pipeline application itself logic is sound, at this point. It could be that
> this is just taxing these machines too hard and we need to increase the
> number of retries?
>
> It reliably fails on this hardware when crunch.max.running.jobs set to
> its default.
>
> Can you explain a little what the /tmp/crunch-XXXXXXX files are as well as
> how Crunch uses side effect files? Do you know if HDFS would clean up those
> directories from underneath Crunch?
>
> There are usually 4 failed applications, failing due to reduces. The
> failures seem to be one of the following three kinds -- (1) No lease on
> <side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
> SocketTimeoutException.
>
> Examples:
>
> [1] No lease exception
>
> Error: org.apache.crunch.CrunchRuntimeException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
> File does not exist. Holder
> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
> any open files. at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:415) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
> at
> org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:415) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
> File does not exist. Holder
> DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
> any open files. at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:415) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
> org.apache.hadoop.ipc.Client.call(Client.java:1410) at
> org.apache.hadoop.ipc.Client.call(Client.java:1363) at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
> at com.sun.proxy.$Proxy13.complete(Unknown Source) at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
> at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
> at
> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
> at
> org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
> at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
> org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
> ... 9 more
>
>
> [2] File does not exist
>
> 2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error: org.apache.crunch.CrunchRuntimeException: Could not read runtime node information
> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
> 	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
> Caused by: java.io.FileNotFoundException: File does not exist: /tmp/crunch-4694113/p470/REDUCE
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> 	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
> 	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
> 	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
> 	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
> 	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
> 	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
> 	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
> 	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
> 	... 9 more
>
> [3] SocketTimeoutException
>
> Error: org.apache.crunch.CrunchRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74) at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720 remote=/10.55.1.230:9200] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>>
>>
>> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Hey Everett,
>>>
>>> Initial thought-- there are lots of reasons for lease expired
>>> exceptions, and their usually more symptomatic of other problems in the
>>> pipeline. Are you sure none of the jobs in the Crunch pipeline on the
>>> non-SSD instances are failing for some other reason? I'd be surprised if no
>>> other errors showed up in the app master, although there are reports of
>>> some weirdness around LeaseExpireds when writing to S3-- but you're not
>>> doing that here, right?
>>>
>>
>> We're reading from and writing to HDFS, here. (We've copied in input from
>> S3 to HDFS in another step.)
>>
>> There are a few exceptions in the logs. Most seem related to missing temp
>> files.
>>
>> Let me see if I can reproduce it with crunch.max.running.jobs set to 1
>> to try to narrow down the originating failure.
>>
>>
>>
>>
>>>
>>> J
>>>
>>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I recently started trying to run our Crunch pipeline on more data and
>>>> have been trying out different AWS instance types in anticipation of our
>>>> storage and compute needs.
>>>>
>>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with
>>>> the CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>>
>>>> Our pipeline finishes fine in these cluster configurations:
>>>>
>>>>    - 50 c3.4xlarge Core, 0 Task
>>>>    - 10 c3.8xlarge Core, 0 Task
>>>>    - 25 c3.8xlarge Core, 0 Task
>>>>
>>>> However, it always fails on the same data when using 10 cc2.8xlarge
>>>> Core instances.
>>>>
>>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>>> hard disks instead of SSDs.
>>>>
>>>> While it's a little hard to track down the exact originating failure, I
>>>> think it's from errors like:
>>>>
>>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>>> org.apache.crunch.CrunchRuntimeException:
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>> No lease on
>>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>>> File does not exist. Holder
>>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>>> any open files.
>>>>
>>>> Those paths look like these side effect files
>>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>>> .
>>>>
>>>> Would Crunch have generated applications that depend on side effect
>>>> paths as input across MapReduce applications and something in HDFS is
>>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>>> configures Hadoop differently for each instance type, and might have more
>>>> aggressive cleanup settings on HDs, though this is very uninformed
>>>> hypothesis.
>>>>
>>>> A sample full log is attached.
>>>>
>>>> Thanks for any guidance!
>>>>
>>>> - Everett
>>>>
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
Hi,

I verified that the pipeline succeeds on the same cc2.8xlarge hardware when
setting crunch.max.running.jobs to 1. I generally feel like the pipeline
application itself logic is sound, at this point. It could be that this is
just taxing these machines too hard and we need to increase the number of
retries?

It reliably fails on this hardware when crunch.max.running.jobs set to its
default.

Can you explain a little what the /tmp/crunch-XXXXXXX files are as well as
how Crunch uses side effect files? Do you know if HDFS would clean up those
directories from underneath Crunch?

There are usually 4 failed applications, failing due to reduces. The
failures seem to be one of the following three kinds -- (1) No lease on
<side effect file>, (2) File not found </tmp/crunch-XXXXXXX> file, (3)
SocketTimeoutException.

Examples:

[1] No lease exception

Error: org.apache.crunch.CrunchRuntimeException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
File does not exist. Holder
DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
any open files. at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
at
org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656) at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/tmp/crunch-4694113/p662/output/_temporary/1/_temporary/attempt_1439917295505_out7_0018_r_000003_1/out7-r-00003:
File does not exist. Holder
DFSClient_attempt_1439917295505_0018_r_000003_1_824053899_1 does not have
any open files. at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2944)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3008)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2988)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:641)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:484)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at
org.apache.hadoop.ipc.Client.call(Client.java:1410) at
org.apache.hadoop.ipc.Client.call(Client.java:1363) at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
at com.sun.proxy.$Proxy13.complete(Unknown Source) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy13.complete(Unknown Source) at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:404)
at
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2130)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2114)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1289)
at
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:87)
at
org.apache.crunch.io.CrunchOutputs$OutputState.close(CrunchOutputs.java:300)
at org.apache.crunch.io.CrunchOutputs.close(CrunchOutputs.java:180) at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:72)
... 9 more


[2] File does not exist

2015-08-18 17:36:10,195 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Diagnostics report from attempt_1439917295505_0034_r_000004_1: Error:
org.apache.crunch.CrunchRuntimeException: Could not read runtime node
information
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:48)
	at org.apache.crunch.impl.mr.run.CrunchReducer.setup(CrunchReducer.java:40)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:172)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.io.FileNotFoundException: File does not exist:
/tmp/crunch-4694113/p470/REDUCE
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1726)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1669)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1649)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1621)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:599)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1147)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1135)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1125)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:273)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:240)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:233)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1298)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:296)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
	at org.apache.crunch.util.DistCache.read(DistCache.java:72)
	at org.apache.crunch.impl.mr.run.CrunchTaskContext.<init>(CrunchTaskContext.java:46)
	... 9 more

[3] SocketTimeoutException

Error: org.apache.crunch.CrunchRuntimeException:
java.net.SocketTimeoutException: 70000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720
remote=/10.55.1.230:9200] at
org.apache.crunch.impl.mr.run.CrunchTaskContext.cleanup(CrunchTaskContext.java:74)
at org.apache.crunch.impl.mr.run.CrunchReducer.cleanup(CrunchReducer.java:64)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:195) at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:656)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:394) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170) Caused
by: java.net.SocketTimeoutException: 70000 millis timeout while
waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.55.1.229:35720
remote=/10.55.1.230:9200] at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83) at
java.io.FilterInputStream.read(FilterInputStream.java:83) at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1075)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1042)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1186)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:935)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:491)













On Fri, Aug 14, 2015 at 3:54 PM, Everett Anderson <ev...@nuna.com> wrote:

>
>
> On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey Everett,
>>
>> Initial thought-- there are lots of reasons for lease expired exceptions,
>> and their usually more symptomatic of other problems in the pipeline. Are
>> you sure none of the jobs in the Crunch pipeline on the non-SSD instances
>> are failing for some other reason? I'd be surprised if no other errors
>> showed up in the app master, although there are reports of some weirdness
>> around LeaseExpireds when writing to S3-- but you're not doing that here,
>> right?
>>
>
> We're reading from and writing to HDFS, here. (We've copied in input from
> S3 to HDFS in another step.)
>
> There are a few exceptions in the logs. Most seem related to missing temp
> files.
>
> Let me see if I can reproduce it with crunch.max.running.jobs set to 1 to
> try to narrow down the originating failure.
>
>
>
>
>>
>> J
>>
>> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I recently started trying to run our Crunch pipeline on more data and
>>> have been trying out different AWS instance types in anticipation of our
>>> storage and compute needs.
>>>
>>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the
>>> CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>>
>>> Our pipeline finishes fine in these cluster configurations:
>>>
>>>    - 50 c3.4xlarge Core, 0 Task
>>>    - 10 c3.8xlarge Core, 0 Task
>>>    - 25 c3.8xlarge Core, 0 Task
>>>
>>> However, it always fails on the same data when using 10 cc2.8xlarge Core
>>> instances.
>>>
>>> The biggest obvious hardware difference is that the cc2.8xlarges use
>>> hard disks instead of SSDs.
>>>
>>> While it's a little hard to track down the exact originating failure, I
>>> think it's from errors like:
>>>
>>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>>> attempt_1439499407003_0028_r_000153_1 - exited :
>>> org.apache.crunch.CrunchRuntimeException:
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>>> File does not exist. Holder
>>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>>> any open files.
>>>
>>> Those paths look like these side effect files
>>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>>> .
>>>
>>> Would Crunch have generated applications that depend on side effect
>>> paths as input across MapReduce applications and something in HDFS is
>>> cleaning up those paths, unaware of the higher level dependencies? AWS
>>> configures Hadoop differently for each instance type, and might have more
>>> aggressive cleanup settings on HDs, though this is very uninformed
>>> hypothesis.
>>>
>>> A sample full log is attached.
>>>
>>> Thanks for any guidance!
>>>
>>> - Everett
>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Everett Anderson <ev...@nuna.com>.
On Fri, Aug 14, 2015 at 3:26 PM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Everett,
>
> Initial thought-- there are lots of reasons for lease expired exceptions,
> and their usually more symptomatic of other problems in the pipeline. Are
> you sure none of the jobs in the Crunch pipeline on the non-SSD instances
> are failing for some other reason? I'd be surprised if no other errors
> showed up in the app master, although there are reports of some weirdness
> around LeaseExpireds when writing to S3-- but you're not doing that here,
> right?
>

We're reading from and writing to HDFS, here. (We've copied in input from
S3 to HDFS in another step.)

There are a few exceptions in the logs. Most seem related to missing temp
files.

Let me see if I can reproduce it with crunch.max.running.jobs set to 1 to
try to narrow down the originating failure.




>
> J
>
> On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> Hi,
>>
>> I recently started trying to run our Crunch pipeline on more data and
>> have been trying out different AWS instance types in anticipation of our
>> storage and compute needs.
>>
>> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the
>> CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>>
>> Our pipeline finishes fine in these cluster configurations:
>>
>>    - 50 c3.4xlarge Core, 0 Task
>>    - 10 c3.8xlarge Core, 0 Task
>>    - 25 c3.8xlarge Core, 0 Task
>>
>> However, it always fails on the same data when using 10 cc2.8xlarge Core
>> instances.
>>
>> The biggest obvious hardware difference is that the cc2.8xlarges use hard
>> disks instead of SSDs.
>>
>> While it's a little hard to track down the exact originating failure, I
>> think it's from errors like:
>>
>> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
>> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
>> attempt_1439499407003_0028_r_000153_1 - exited :
>> org.apache.crunch.CrunchRuntimeException:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
>> File does not exist. Holder
>> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
>> any open files.
>>
>> Those paths look like these side effect files
>> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
>> .
>>
>> Would Crunch have generated applications that depend on side effect paths
>> as input across MapReduce applications and something in HDFS is cleaning up
>> those paths, unaware of the higher level dependencies? AWS configures
>> Hadoop differently for each instance type, and might have more aggressive
>> cleanup settings on HDs, though this is very uninformed hypothesis.
>>
>> A sample full log is attached.
>>
>> Thanks for any guidance!
>>
>> - Everett
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: LeaseExpiredExceptions and temp side effect files

Posted by Josh Wills <jw...@cloudera.com>.
Hey Everett,

Initial thought-- there are lots of reasons for lease expired exceptions,
and their usually more symptomatic of other problems in the pipeline. Are
you sure none of the jobs in the Crunch pipeline on the non-SSD instances
are failing for some other reason? I'd be surprised if no other errors
showed up in the app master, although there are reports of some weirdness
around LeaseExpireds when writing to S3-- but you're not doing that here,
right?

J

On Fri, Aug 14, 2015 at 2:10 PM, Everett Anderson <ev...@nuna.com> wrote:

> Hi,
>
> I recently started trying to run our Crunch pipeline on more data and have
> been trying out different AWS instance types in anticipation of our storage
> and compute needs.
>
> I was using EMR 3.8 (so Hadoop 2.4.0) with Crunch 0.12 (patched with the
> CRUNCH-553 <https://issues.apache.org/jira/browse/CRUNCH-553> fix).
>
> Our pipeline finishes fine in these cluster configurations:
>
>    - 50 c3.4xlarge Core, 0 Task
>    - 10 c3.8xlarge Core, 0 Task
>    - 25 c3.8xlarge Core, 0 Task
>
> However, it always fails on the same data when using 10 cc2.8xlarge Core
> instances.
>
> The biggest obvious hardware difference is that the cc2.8xlarges use hard
> disks instead of SSDs.
>
> While it's a little hard to track down the exact originating failure, I
> think it's from errors like:
>
> 2015-08-13 21:34:38,379 ERROR [IPC Server handler 24 on 45711]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
> attempt_1439499407003_0028_r_000153_1 - exited :
> org.apache.crunch.CrunchRuntimeException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/crunch-970849245/p662/output/_temporary/1/_temporary/attempt_1439499407003_out7_0028_r_000153_1/out7-r-00153:
> File does not exist. Holder
> DFSClient_attempt_1439499407003_0028_r_000153_1_609888542_1 does not have
> any open files.
>
> Those paths look like these side effect files
> <https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)>
> .
>
> Would Crunch have generated applications that depend on side effect paths
> as input across MapReduce applications and something in HDFS is cleaning up
> those paths, unaware of the higher level dependencies? AWS configures
> Hadoop differently for each instance type, and might have more aggressive
> cleanup settings on HDs, though this is very uninformed hypothesis.
>
> A sample full log is attached.
>
> Thanks for any guidance!
>
> - Everett
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.




-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>