You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Eli Reisman <ap...@gmail.com> on 2014/02/09 19:49:04 UTC

Re: DataStreamer Exception - LeaseExpiredException

Thanks Kristen,

I see another email that someone else is having trouble with this. The
problem is a shim to make up for the extra task (Application Manager) that
non-YARN Hadoop does not use. I think over the time since the original
pre-Hadoop-2.2. implementation I did, some stuff has changed and the old
shim isn't working any more.

The issue is that I think the GIRAPH-747 solution works for us but will
break non-YARN implementation. We need to sit down and figure out a better
solution and so far I haven't had time. Thanks for keeping the conversation
going, I'm sure one of you, myself, or Mohammed will sit down and code up a
better fix soon.

Thanks,

Eli


On Thu, Jan 30, 2014 at 1:29 PM, Kristen Hardwick <kh...@spryinc.com>wrote:

> Eli, Chuan,
>
> Thanks for taking a look into my issue! GIRAPH-747 definitely seems to
> address the exact issue I'm running into, even down to the class I thought
> was causing the problem. I created a bug ticket a few days ago (GIRAPH-828)
> which has the details of my environment, including the command I'm running
> and the full logs where the problem occurs. I just linked my ticket to
> GIRAPH-747, but if it makes sense for me to delete mine instead, please let
> me know.
>
> I will definitely put a comment in there so that people watching it are
> aware of Chuan's patch. Avery Ching was asking me for more information in
> the comments, so he might be able to help validate the solution.
>
> Thanks again,
> Kristen
>
>
> On Wed, Jan 29, 2014 at 9:35 PM, Eli Reisman <ap...@gmail.com>wrote:
>
>> Sorry, I do think this will solve it and it makes sense people are
>> encountering the prob when using -w 1 I'll get this reviewed and committed
>> (patch 747)
>>
>> Mohammed, any objections?
>>
>>
>>
>> On Wed, Jan 29, 2014 at 6:22 PM, Chuan Lei <le...@gmail.com> wrote:
>>
>>> Hi Kristen,
>>>
>>> I had this problem before and submitted a Jira ticket (GIRAPH-747) with
>>> path. You may want to take a look at it. Hope that can solve your problem.
>>>
>>> Thanks,
>>> Chuan
>>>
>>> On Jan 29, 2014, at 9:16 PM, Eli Reisman <ap...@gmail.com>
>>> wrote:
>>>
>>> > Hi Kristen, thanks for posting this. During the port to YARN I
>>> encountered some race problems with the output sequence. The YARN
>>> implementation has to handle this a bit differently than the non-YARN and
>>> although we got it figured out at the time, I haven't really looked at it
>>> in many months and non-YARN Giraph has evolved quickly since then. Wouldn't
>>> shock me if there is trouble here, if I recall the solution seemed a bit
>>> delicate.
>>> >
>>> > If you have some ideas for a patch I'd be happy to review, I am pretty
>>> strapped for time right now but if you post a ticket to the Giraph JIRA and
>>> no one else attempts a patch I'm sure either myself or Mohammed will take a
>>> swipe at it eventually. Thanks!
>>> >
>>> > Eli
>>> >
>>> >
>>> > On Mon, Jan 20, 2014 at 9:01 AM, Kristen Hardwick <
>>> khardwick@spryinc.com> wrote:
>>> > Sorry to bug everyone again, but does anyone have any ideas on this?
>>> Please let me know if I'm leaving out any crucial information that could
>>> get me some help.
>>> >
>>> > Thanks!
>>> > Kristen
>>> >
>>> >
>>> > On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick <
>>> khardwick@spryinc.com> wrote:
>>> > Hi all,
>>> >
>>> > I had a very productive day today getting this stuff figured out.
>>> Unfortunately, it appears that I've stumbled onto a possible race condition
>>> during the cleanup step of the code for the application.
>>> >
>>> > I put some information here that explains why I think it is a race
>>> condition. http://pastebin.com/Qswb98dq Basically, I tried the exact
>>> same command twice, making no other changes - the first time it failed and
>>> the second time it succeeded.
>>> >
>>> > This makes me think that the
>>> LeaseExpiredException/DataStreamerException is caused because the files
>>> have been cleaned up just before they are needed. Possibly inside the
>>> BspServiceMaster, but I am not at all sure about that.
>>> >
>>> > Is anyone already aware of this? Should I log it as a bug? I do have
>>> access to (DEBUG) logs of both the successful and failed attempts if anyone
>>> wants to see them.
>>> >
>>> > Thanks,
>>> > Kristen Hardwick
>>> >
>>> >
>>> > On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <
>>> khardwick@spryinc.com> wrote:
>>> > Hi Avery (or anyone else that knows),
>>> >
>>> > Could you please give me some details that would help me find the past
>>> threads that might address this issue? I searched Google with various
>>> combinations of "giraph datastreamer exception yarn lease expired
>>> zookeeper" and didn't really come up with anything that seemed relevant.
>>> >
>>> > Is it possible that it's just a memory issue on my end? I'm running
>>> inside a VM - a single node cluster with 8 GB of memory allocated to it.
>>> Could that have anything to do with it? Right now I'm investigating the
>>> code to try to lower the amount of memory allocated to the containers.
>>> >
>>> > Thanks,
>>> > Kristen
>>> >
>>> >
>>> > On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <ac...@apache.org>
>>> wrote:
>>> > This looks more like the Zookeeper/YARN issues mentioned in the past.
>>>  Unfortunately, I do not have a YARN instance to test this with.  Does
>>> anyone else have any insights here?
>>> >
>>> >
>>> > On 1/10/14 1:48 PM, Kristen Hardwick wrote:
>>> >> Hi all, I'm requesting help again! I'm trying to get this
>>> SimpleShortestPathsComputation example working, but I'm stuck again. Now
>>> the job begins to run and seems to work until the final step (it performs 3
>>> supersteps), but the overall job is failing.
>>> >>
>>> >> In the master, among other things, I see:
>>> >>
>>> >> ...
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds.
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took
>>> 0.708 seconds.
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158
>>> seconds.
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344
>>> seconds.
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064
>>> seconds.
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162
>>> seconds.
>>> >> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds.
>>> >> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit
>>> final job output data.
>>> >> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the
>>> final job output data.
>>> >> ...
>>> >>
>>> >> To me, that looks promising - like the job was successful. However,
>>> in the WORKER_ONLY containers, I see these things:
>>> >>
>>> >> ...
>>> >> 14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for
>>> WORKER_ONLY
>>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>>> unprocessed event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
>>> type=NodeDeleted, state=SyncConnected)
>>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>>> partitionExchangeChildrenChanged (at least one worker is done sending
>>> partitions)
>>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>>> unprocessed event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
>>> type=NodeDeleted, state=SyncConnected)
>>> >> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait
>>> threshold, 1 connections closed, releasing NettyClient.bootstrap resources
>>> now.
>>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job
>>> state changed, checking to see if it needs to restart
>>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>>> exists
>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>>> >> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1]
>>> saveVertices: Starting to save 2 vertices using 1 threads
>>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices:
>>> Starting to save 2 vertices using 1 threads
>>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job
>>> state changed, checking to see if it needs to restart
>>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>>> exists
>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is
>>> empty! -
>>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState
>>> >> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling
>>> watcher
>>> >> java.lang.NullPointerException
>>> >>         at java.io.StringReader.<init>(StringReader.java:50)
>>> >>         at org.json.JSONTokener.<init>(JSONTokener.java:66)
>>> >>         at org.json.JSONObject.<init>(JSONObject.java:402)
>>> >>         at
>>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716)
>>> >>         at
>>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563)
>>> >>         at
>>> org.apache.giraph.bsp.BspService.process(BspService.java:1095)
>>> >>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>>> >>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>>> unprocessed event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady,
>>> type=NodeDeleted, state=SyncConnected)
>>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>>> unprocessed event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions,
>>> type=NodeDeleted, state=SyncConnected)
>>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>>> partitionExchangeChildrenChanged (at least one worker is done sending
>>> partitions)
>>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>>> unprocessed event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>>> type=NodeDeleted, state=SyncConnected)
>>> >> ...
>>> >> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception
>>> >>
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001:
>>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not
>>> have any open files.
>>> >>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>>> >>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
>>> >>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480)
>>> >>         at
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
>>> >>         at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>>> >> ...
>>> >>
>>> >> I apologize for the wall of error message, but I tried to leave in at
>>> least some of the parts that might be useful. I put the entire YARN log
>>> here: http://tny.cz/af229738
>>> >>
>>> >> Has anyone ever seen this before? This is the command I'm using to
>>> run:
>>> >>
>>> >> hadoop jar
>>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false
>>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
>>> -Dgiraph.useInputSplitLocality=false
>>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif
>>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
>>> -vip /user/spry/input -vof
>>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>> /user/spry/Shortest -w 1
>>> >>
>>> >> My setup is still the same as the other email if you saw it:
>>> >>
>>> >> I compiled Giraph with this command, and everything built
>>> successfully except "Apache Giraph Distribution" which it doesn't seem like
>>> I need:
>>> >>
>>> >> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package
>>> >>
>>> >> I am running with the following components:
>>> >>
>>> >> Single node cluster
>>> >> Giraph 1.1
>>> >> Hadoop 2.2.0 (Hortonworks)
>>> >> Java 1.7.0_45
>>> >>
>>> >> Thanks in advance,
>>> >> -Kristen Hardwick
>>> >>
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>