You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Brian Topping <br...@gmail.com> on 2015/05/06 17:32:30 UTC

Debugging hadoop-mesos

Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.

https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.

I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".

Any input kindly appreciated!

Brian

RESOLVED -- Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Ok, I stared at the code for a long time and came up with https://github.com/mesos/hadoop/pull/55 <https://github.com/mesos/hadoop/pull/55>. It probably should have been separate PRs for cleanups and method shuffling in one and the meat of the changes in another, sorry about that. The PR itself should have a decent description, please feel free to ask questions or critique it in the PR.

It seems like the build needs help with unit testing and release process. I think there's going to need to be a CI build that can build for various versions of CDH and assign the version to an artifact classifier before they can be easily managed on central. I'm happy to pitch in on these if anyone is interested. Testing this kind of code is a little tricky, but it generally results in better patterns when it's all finished.

Thanks for all of your help!! I'm looking forward to starting what I came to this stack to work on :)

Brian

> On May 8, 2015, at 3:06 PM, Brian Topping <br...@gmail.com> wrote:
> 
> Indeed, this was all that was left to get jobs working, thanks!
> 
> Last thing I need to do for initial setup is get rid of the thousands of these messages, about three or four per second. I'm running against 2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics.
> 
>> 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> 
>> On May 8, 2015, at 2:47 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.
>> 
>> On Fri, May 8, 2015 at 3:43 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> Mesos runs as root, hadoop is as a separate user.
>> 
>>> On May 8, 2015, at 2:41 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> You run everything in root?
>>> 
>>> On Fri, May 8, 2015 at 3:38 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>> Seems you don't have permission for this directory:
>>> 
>>> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>>> 
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>>> 
>>> 
>>> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>> Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output logs of what I am currently seeing. I've edited them for size, the message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>" appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it.
>>> 
>>> The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated "slots" problem in my previous message has resolved.
>>> 
>>> My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now.
>>> 
>>>> On May 8, 2015, at 1:56 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem.
>>>> 
>>>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>> I think there's something weird here:
>>>>>   cpus: offered 2.0 needed at least 1.0
>>>>>   mem : offered 1724.0 needed at least 1024.0
>>>>>   disk: offered 44124.0 needed at least 1024.0
>>>>>   ports:  at least 2 (sufficient)
>>>> 
>>>> Am I misreading this? All of the requirements seem to be met.
>>>> 
>>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>>> 
>>>>> int slots = mapSlotsMax + reduceSlotsMax;
>>>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>>> 
>>>>> // Is this offer too small for even the minimum slots?
>>>>> if (slots < 1) {
>>>>>   return false;
>>>>> }
>>>> 
>>>> Not exactly sure what this is doing.
>>>> 
>>>> Sorry for the noise.
>>>> 
>>>>> 
>>>>> On May 7, 2015, at 6:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..
>>>>> 
>>>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <ma...@duedil.com>> wrote:
>>>>>> 
>>>>>> Hi Brian,
>>>>>> 
>>>>>> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>>>> 
>>>>>> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Tom Arnfeld
>>>>>> Developer // DueDil
>>>>>> 
>>>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>>>> 25 Christopher Street, London, EC2A 2BS
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
>>>>>> 
>>>>>> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
>>>>>> 
>>>>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
>>>>>> 
>>>>>> At this point, I have tested Mesos with:
>>>>>> 
>>>>>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>>>> 
>>>>>> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>>>>>> 
>>>>>> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>>>>>> 
>>>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>>>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>>>>>> [Repeated a few times a second for five seconds]
>>>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>>>>>>       Pending Map Tasks: 4
>>>>>>>    Pending Reduce Tasks: 1
>>>>>>>       Running Map Tasks: 0
>>>>>>>    Running Reduce Tasks: 0
>>>>>>>          Idle Map Slots: 0
>>>>>>>       Idle Reduce Slots: 0
>>>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>>>        Needed Map Slots: 0
>>>>>>>     Needed Reduce Slots: 0
>>>>>>>      Unhealthy Trackers: 0
>>>>>> 
>>>>>> This looks close.
>>>>>> 
>>>>>> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
>>>>>> 
>>>>>> best, Brian
>>>>>> 
>>>>>> 
>>>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>>>>>>> 
>>>>>>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>>>>>>> 
>>>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> Do you start tasktracker successfully?
>>>>>>> 
>>>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>>>>>>> 
>>>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>>>>>>> 
>>>>>>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>>>>>>> 
>>>>>>> Any input kindly appreciated!
>>>>>>> 
>>>>>>> Brian
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Haosdent Huang
>>>>>>> 
>>>>>> 
>>>>>> <signature.asc>
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Indeed, this was all that was left to get jobs working, thanks!

Last thing I need to do for initial setup is get rid of the thousands of these messages, about three or four per second. I'm running against 2.6.0-mr1-cdh5.4.0, maybe there was a change to the API semantics.

> 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> 
> On May 8, 2015, at 2:47 PM, haosdent <ha...@gmail.com> wrote:
> 
> I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.
> 
> On Fri, May 8, 2015 at 3:43 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> Mesos runs as root, hadoop is as a separate user.
> 
>> On May 8, 2015, at 2:41 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> 
>> You run everything in root?
>> 
>> On Fri, May 8, 2015 at 3:38 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> Seems you don't have permission for this directory:
>> 
>> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>> 
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>> 
>> 
>> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output logs of what I am currently seeing. I've edited them for size, the message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>" appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it.
>> 
>> The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated "slots" problem in my previous message has resolved.
>> 
>> My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now.
>> 
>>> On May 8, 2015, at 1:56 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem.
>>> 
>>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>> I think there's something weird here:
>>>>   cpus: offered 2.0 needed at least 1.0
>>>>   mem : offered 1724.0 needed at least 1024.0
>>>>   disk: offered 44124.0 needed at least 1024.0
>>>>   ports:  at least 2 (sufficient)
>>> 
>>> Am I misreading this? All of the requirements seem to be met.
>>> 
>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>> 
>>>> int slots = mapSlotsMax + reduceSlotsMax;
>>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>> 
>>>> // Is this offer too small for even the minimum slots?
>>>> if (slots < 1) {
>>>>   return false;
>>>> }
>>> 
>>> Not exactly sure what this is doing.
>>> 
>>> Sorry for the noise.
>>> 
>>>> 
>>>> On May 7, 2015, at 6:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..
>>>> 
>>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <ma...@duedil.com>> wrote:
>>>>> 
>>>>> Hi Brian,
>>>>> 
>>>>> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>>> 
>>>>> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
>>>>> 
>>>>> --
>>>>> 
>>>>> Tom Arnfeld
>>>>> Developer // DueDil
>>>>> 
>>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>>> 25 Christopher Street, London, EC2A 2BS
>>>>> 
>>>>> 
>>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
>>>>> 
>>>>> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
>>>>> 
>>>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
>>>>> 
>>>>> At this point, I have tested Mesos with:
>>>>> 
>>>>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>>> 
>>>>> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>>>>> 
>>>>> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>>>>> 
>>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>>>>> [Repeated a few times a second for five seconds]
>>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>>>>>       Pending Map Tasks: 4
>>>>>>    Pending Reduce Tasks: 1
>>>>>>       Running Map Tasks: 0
>>>>>>    Running Reduce Tasks: 0
>>>>>>          Idle Map Slots: 0
>>>>>>       Idle Reduce Slots: 0
>>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>>        Needed Map Slots: 0
>>>>>>     Needed Reduce Slots: 0
>>>>>>      Unhealthy Trackers: 0
>>>>> 
>>>>> This looks close.
>>>>> 
>>>>> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
>>>>> 
>>>>> best, Brian
>>>>> 
>>>>> 
>>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>>>>>> 
>>>>>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>>>>>> 
>>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>>>>> Do you start tasktracker successfully?
>>>>>> 
>>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>>>>>> 
>>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>>>>>> 
>>>>>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>>>>>> 
>>>>>> Any input kindly appreciated!
>>>>>> 
>>>>>> Brian
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>> 
>>>>> 
>>>>> <signature.asc>
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

Re: Debugging hadoop-mesos

Posted by haosdent <ha...@gmail.com>.

I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.

On Fri, May 8, 2015 at 3:43 PM, Brian Topping <br...@gmail.com>
wrote:

> Mesos runs as root, hadoop is as a separate user.
>
> On May 8, 2015, at 2:41 PM, haosdent <ha...@gmail.com> wrote:
>
> You run everything in root?
>
> On Fri, May 8, 2015 at 3:38 PM, haosdent <ha...@gmail.com> wrote:
>
>> Seems you don't have permission for this directory:
>>
>> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>> 	
>>
>> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <br...@gmail.com>
>> wrote:
>>
>>> Thanks Hasodent, I've updated
>>> https://gist.github.com/briantopping/311960f8e5454dbe9aab with the
>>> output logs of what I am currently seeing. I've edited them for size, the
>>> message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
>>> TaskTracker: http://10.211.55.16:50060" appeared a few thousand times
>>> in the logs. The configuration I have is probably still broken, 50060 is a
>>> Jetty port that returns a Cloudera string when telnetting to it.
>>>
>>> The error I saw below were apparently the result of building against the
>>> older version of CDH, when I updated the hadoop-mesos POM to match my
>>> deployment version, the incorrectly calculated "slots" problem in my
>>> previous message has resolved.
>>>
>>> My current problem is a Hadoop logging problem and nothing to do with
>>> Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in
>>> /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any
>>> difference. Just getting back into it now.
>>>
>>> On May 8, 2015, at 1:56 PM, haosdent <ha...@gmail.com> wrote:
>>>
>>> Could you post the log in executors which run jobtracker and
>>> taskstracks? It would be helpful to find the cause of this problem.
>>>
>>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <br...@gmail.com>
>>> wrote:
>>>
>>>> I think there's something weird here:
>>>>
>>>>   cpus: offered 2.0 needed at least 1.0
>>>>   mem : offered 1724.0 needed at least 1024.0
>>>>   disk: offered 44124.0 needed at least 1024.0
>>>>   ports:  at least 2 (sufficient)
>>>>
>>>>
>>>> Am I misreading this? All of the requirements seem to be met.
>>>>
>>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>>>
>>>> int slots = mapSlotsMax + reduceSlotsMax;
>>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>>
>>>>
>>>> // Is this offer too small for even the minimum slots?
>>>> if (slots < 1) {
>>>>   return false;
>>>> }
>>>>
>>>>
>>>> Not exactly sure what this is doing.
>>>>
>>>> Sorry for the noise.
>>>>
>>>>
>>>> On May 7, 2015, at 6:32 PM, Brian Topping <br...@gmail.com>
>>>> wrote:
>>>>
>>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab has
>>>> some more information necessary at this point... sorry for the omission..
>>>>
>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
>>>>
>>>> Hi Brian,
>>>>
>>>> At this point you should see the TT attempting to be launched via
>>>> Mesos. The "launched but not heartbeat yet" count tells us that the
>>>> framework has accepted resources for 4 slots but the TT hasn't actually
>>>> come up yet.
>>>>
>>>> Do you see the task in your Meaos cluster UI, and is there anything
>>>> interesting in the task logs?
>>>>
>>>> --
>>>>
>>>> Tom Arnfeld
>>>> Developer // DueDil
>>>>
>>>> (+44) 7525940046
>>>> 25 Christopher Street, London, EC2A 2BS
>>>>
>>>>
>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com
>>>> > wrote:
>>>>
>>>>> Thanks guys, this was helpful. I started the job tracker as a service,
>>>>> but apparently I never started the task tracker (or it failed to start and
>>>>> I didn't notice). I started it after Haosdent's message, but wasn't able to
>>>>> see any difference and I kept poking around.
>>>>>
>>>>> After making some changes and the VM wouldn't boot, my OCD got the
>>>>> better of me and I reinstalled everything from scratch. There are just too
>>>>> many moving parts to hassle you guys with an imperfect install on my end.
>>>>>
>>>>> This time through, I felt a lot more confident to use the Mesosphere
>>>>> RPMs, but I couldn't find the best way to get things launched.
>>>>> https://docs.mesosphere.com/reference/packages/ has a Last-Modified
>>>>> of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't
>>>>> have any init.d service descriptions as the packages page would indicate.
>>>>> For now, I just launched them manually, but would like to get the machine
>>>>> to completely load on boot as services.
>>>>>
>>>>> At this point, I have tested Mesos with:
>>>>>
>>>>>  mesos-execute
>>>>> --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>>>
>>>>> The only problem there is it seems that "localhost" isn't good enough
>>>>> for my install, it needs to be the FQDN, but it works and the job flows
>>>>> through the UI.
>>>>>
>>>>> Now, back to a hadoop job. When I try the job now, the logs show the
>>>>> following stream of repeated messages:
>>>>>
>>>>>  2015-05-07 17:52:53,124 INFO
>>>>> org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots
>>>>> needed.
>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler:
>>>>> Unknown/exited TaskTracker: http://10.211.55.16:50060.
>>>>> [Repeated a few times a second for five seconds]
>>>>>
>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>>>> JobTracker Status
>>>>>
>>>>>       Pending Map Tasks: 4
>>>>>
>>>>>    Pending Reduce Tasks: 1
>>>>>       Running Map Tasks: 0
>>>>>    Running Reduce Tasks: 0
>>>>>          Idle Map Slots: 0
>>>>>       Idle Reduce Slots: 0
>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>        Needed Map Slots: 0
>>>>>     Needed Reduce Slots: 0
>>>>>      Unhealthy Trackers: 0
>>>>>
>>>>>
>>>>> This looks close.
>>>>>
>>>>> What's the best way to get a JDWP port set up to break in this code
>>>>> (i.e. learning to fish...)?
>>>>>
>>>>> best, Brian
>>>>>
>>>>>
>>>>>  On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io>
>>>>> wrote:
>>>>>
>>>>> From the mesos-master log and the JT log, it doesn't look like the
>>>>> MesosScheduler ever registered with Mesos, which should mean that it
>>>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does
>>>>> seem to show a tasktracker running. Did you start that yourself (or
>>>>> automatically as a system service)?
>>>>>
>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <ha...@gmail.com> wrote:
>>>>>
>>>>>> Do you start tasktracker successfully?
>>>>>>
>>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <
>>>>>> brian.topping@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all, I'm happy to report that I'm very close to
>>>>>>> getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the
>>>>>>> hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes
>>>>>>> to parse what I've got here and suggest something to try.
>>>>>>>
>>>>>>>  https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully
>>>>>>> has all the data necessary between the console output of the client run,
>>>>>>> the mesos master and slave console, the XML configuration of the JT and the
>>>>>>> output that was generated by it. Please let me know if I've left something
>>>>>>> out.
>>>>>>>
>>>>>>> I iterated a few times getting all the errors from missing paths or
>>>>>>> libraries sorted out, but the example client ultimately just sits waiting
>>>>>>> forever at "map 0% reduce 0%".
>>>>>>>
>>>>>>> Any input kindly appreciated!
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>>
>>>>>
>>>>>
>>>>>  <signature.asc>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Mesos runs as root, hadoop is as a separate user.

> On May 8, 2015, at 2:41 PM, haosdent <ha...@gmail.com> wrote:
> 
> You run everything in root?
> 
> On Fri, May 8, 2015 at 3:38 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
> Seems you don't have permission for this directory:
> 
> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
> 
> at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> 
> 
> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output logs of what I am currently seeing. I've edited them for size, the message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>" appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it.
> 
> The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated "slots" problem in my previous message has resolved.
> 
> My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now.
> 
>> On May 8, 2015, at 1:56 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem.
>> 
>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> I think there's something weird here:
>>>   cpus: offered 2.0 needed at least 1.0
>>>   mem : offered 1724.0 needed at least 1024.0
>>>   disk: offered 44124.0 needed at least 1024.0
>>>   ports:  at least 2 (sufficient)
>> 
>> Am I misreading this? All of the requirements seem to be met.
>> 
>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>> 
>>> int slots = mapSlotsMax + reduceSlotsMax;
>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>> 
>>> // Is this offer too small for even the minimum slots?
>>> if (slots < 1) {
>>>   return false;
>>> }
>> 
>> Not exactly sure what this is doing.
>> 
>> Sorry for the noise.
>> 
>>> 
>>> On May 7, 2015, at 6:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..
>>> 
>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <ma...@duedil.com>> wrote:
>>>> 
>>>> Hi Brian,
>>>> 
>>>> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>> 
>>>> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
>>>> 
>>>> --
>>>> 
>>>> Tom Arnfeld
>>>> Developer // DueDil
>>>> 
>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>> 25 Christopher Street, London, EC2A 2BS
>>>> 
>>>> 
>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
>>>> 
>>>> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
>>>> 
>>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
>>>> 
>>>> At this point, I have tested Mesos with:
>>>> 
>>>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>> 
>>>> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>>>> 
>>>> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>>>> 
>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>>>> [Repeated a few times a second for five seconds]
>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>>>>       Pending Map Tasks: 4
>>>>>    Pending Reduce Tasks: 1
>>>>>       Running Map Tasks: 0
>>>>>    Running Reduce Tasks: 0
>>>>>          Idle Map Slots: 0
>>>>>       Idle Reduce Slots: 0
>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>        Needed Map Slots: 0
>>>>>     Needed Reduce Slots: 0
>>>>>      Unhealthy Trackers: 0
>>>> 
>>>> This looks close.
>>>> 
>>>> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
>>>> 
>>>> best, Brian
>>>> 
>>>> 
>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>>>>> 
>>>>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>>>>> 
>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>>>> Do you start tasktracker successfully?
>>>>> 
>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>>>>> 
>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>>>>> 
>>>>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>>>>> 
>>>>> Any input kindly appreciated!
>>>>> 
>>>>> Brian
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>>> 
>>>> 
>>>> <signature.asc>
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

Re: Debugging hadoop-mesos

Posted by haosdent <ha...@gmail.com>.

You run everything in root?

On Fri, May 8, 2015 at 3:38 PM, haosdent <ha...@gmail.com> wrote:

> Seems you don't have permission for this directory:
>
> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> 	
>
> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <br...@gmail.com>
> wrote:
>
>> Thanks Hasodent, I've updated
>> https://gist.github.com/briantopping/311960f8e5454dbe9aab with the
>> output logs of what I am currently seeing. I've edited them for size, the
>> message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
>> TaskTracker: http://10.211.55.16:50060" appeared a few thousand times in
>> the logs. The configuration I have is probably still broken, 50060 is a
>> Jetty port that returns a Cloudera string when telnetting to it.
>>
>> The error I saw below were apparently the result of building against the
>> older version of CDH, when I updated the hadoop-mesos POM to match my
>> deployment version, the incorrectly calculated "slots" problem in my
>> previous message has resolved.
>>
>> My current problem is a Hadoop logging problem and nothing to do with
>> Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in
>> /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any
>> difference. Just getting back into it now.
>>
>> On May 8, 2015, at 1:56 PM, haosdent <ha...@gmail.com> wrote:
>>
>> Could you post the log in executors which run jobtracker and taskstracks?
>> It would be helpful to find the cause of this problem.
>>
>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <br...@gmail.com>
>> wrote:
>>
>>> I think there's something weird here:
>>>
>>>   cpus: offered 2.0 needed at least 1.0
>>>   mem : offered 1724.0 needed at least 1024.0
>>>   disk: offered 44124.0 needed at least 1024.0
>>>   ports:  at least 2 (sufficient)
>>>
>>>
>>> Am I misreading this? All of the requirements seem to be met.
>>>
>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>>
>>> int slots = mapSlotsMax + reduceSlotsMax;
>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>
>>>
>>> // Is this offer too small for even the minimum slots?
>>> if (slots < 1) {
>>>   return false;
>>> }
>>>
>>>
>>> Not exactly sure what this is doing.
>>>
>>> Sorry for the noise.
>>>
>>>
>>> On May 7, 2015, at 6:32 PM, Brian Topping <br...@gmail.com>
>>> wrote:
>>>
>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab has
>>> some more information necessary at this point... sorry for the omission..
>>>
>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
>>>
>>> Hi Brian,
>>>
>>> At this point you should see the TT attempting to be launched via Mesos.
>>> The "launched but not heartbeat yet" count tells us that the framework has
>>> accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>
>>> Do you see the task in your Meaos cluster UI, and is there anything
>>> interesting in the task logs?
>>>
>>> --
>>>
>>> Tom Arnfeld
>>> Developer // DueDil
>>>
>>> (+44) 7525940046
>>> 25 Christopher Street, London, EC2A 2BS
>>>
>>>
>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <br...@gmail.com>
>>> wrote:
>>>
>>>> Thanks guys, this was helpful. I started the job tracker as a service,
>>>> but apparently I never started the task tracker (or it failed to start and
>>>> I didn't notice). I started it after Haosdent's message, but wasn't able to
>>>> see any difference and I kept poking around.
>>>>
>>>> After making some changes and the VM wouldn't boot, my OCD got the
>>>> better of me and I reinstalled everything from scratch. There are just too
>>>> many moving parts to hassle you guys with an imperfect install on my end.
>>>>
>>>> This time through, I felt a lot more confident to use the Mesosphere
>>>> RPMs, but I couldn't find the best way to get things launched.
>>>> https://docs.mesosphere.com/reference/packages/ has a Last-Modified
>>>> of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't
>>>> have any init.d service descriptions as the packages page would indicate.
>>>> For now, I just launched them manually, but would like to get the machine
>>>> to completely load on boot as services.
>>>>
>>>> At this point, I have tested Mesos with:
>>>>
>>>>  mesos-execute
>>>> --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>>
>>>> The only problem there is it seems that "localhost" isn't good enough
>>>> for my install, it needs to be the FQDN, but it works and the job flows
>>>> through the UI.
>>>>
>>>> Now, back to a hadoop job. When I try the job now, the logs show the
>>>> following stream of repeated messages:
>>>>
>>>>  2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>>> Satisfied map and reduce slots needed.
>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler:
>>>> Unknown/exited TaskTracker: http://10.211.55.16:50060.
>>>> [Repeated a few times a second for five seconds]
>>>>
>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>>> JobTracker Status
>>>>
>>>>       Pending Map Tasks: 4
>>>>
>>>>    Pending Reduce Tasks: 1
>>>>       Running Map Tasks: 0
>>>>    Running Reduce Tasks: 0
>>>>          Idle Map Slots: 0
>>>>       Idle Reduce Slots: 0
>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>        Needed Map Slots: 0
>>>>     Needed Reduce Slots: 0
>>>>      Unhealthy Trackers: 0
>>>>
>>>>
>>>> This looks close.
>>>>
>>>> What's the best way to get a JDWP port set up to break in this code
>>>> (i.e. learning to fish...)?
>>>>
>>>> best, Brian
>>>>
>>>>
>>>>  On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io> wrote:
>>>>
>>>> From the mesos-master log and the JT log, it doesn't look like the
>>>> MesosScheduler ever registered with Mesos, which should mean that it
>>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does
>>>> seem to show a tasktracker running. Did you start that yourself (or
>>>> automatically as a system service)?
>>>>
>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <ha...@gmail.com> wrote:
>>>>
>>>>> Do you start tasktracker successfully?
>>>>>
>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <
>>>>> brian.topping@gmail.com> wrote:
>>>>>
>>>>>> Hi all, I'm happy to report that I'm very close to
>>>>>> getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the
>>>>>> hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes
>>>>>> to parse what I've got here and suggest something to try.
>>>>>>
>>>>>>  https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully
>>>>>> has all the data necessary between the console output of the client run,
>>>>>> the mesos master and slave console, the XML configuration of the JT and the
>>>>>> output that was generated by it. Please let me know if I've left something
>>>>>> out.
>>>>>>
>>>>>> I iterated a few times getting all the errors from missing paths or
>>>>>> libraries sorted out, but the example client ultimately just sits waiting
>>>>>> forever at "map 0% reduce 0%".
>>>>>>
>>>>>> Any input kindly appreciated!
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>>>
>>>>
>>>>
>>>>  <signature.asc>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Re: Debugging hadoop-mesos

Posted by haosdent <ha...@gmail.com>.

You could use these environments to set user.
export HADOOP_USER=test
export HADOOP_GROUP=user

And for logs, Hadoop log to ${HADOOP_HOME}/logs. I think you hadoop home is
/usr/lib/hadoop. Ignore mapred.mesos.executor.directory please.

On Fri, May 8, 2015 at 3:40 PM, Brian Topping <br...@gmail.com>
wrote:

> That's correct, but /usr/lib/hadoop/logs doesn't even exist. It should be
> logging to /var/log/hadoop.
>
> On May 8, 2015, at 2:38 PM, haosdent <ha...@gmail.com> wrote:
>
> Seems you don't have permission for this directory:
>
> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> 	
>
> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <br...@gmail.com>
> wrote:
>
>> Thanks Hasodent, I've updated
>> https://gist.github.com/briantopping/311960f8e5454dbe9aab with the
>> output logs of what I am currently seeing. I've edited them for size, the
>> message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
>> TaskTracker: http://10.211.55.16:50060" appeared a few thousand times in
>> the logs. The configuration I have is probably still broken, 50060 is a
>> Jetty port that returns a Cloudera string when telnetting to it.
>>
>> The error I saw below were apparently the result of building against the
>> older version of CDH, when I updated the hadoop-mesos POM to match my
>> deployment version, the incorrectly calculated "slots" problem in my
>> previous message has resolved.
>>
>> My current problem is a Hadoop logging problem and nothing to do with
>> Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in
>> /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any
>> difference. Just getting back into it now.
>>
>> On May 8, 2015, at 1:56 PM, haosdent <ha...@gmail.com> wrote:
>>
>> Could you post the log in executors which run jobtracker and taskstracks?
>> It would be helpful to find the cause of this problem.
>>
>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <br...@gmail.com>
>> wrote:
>>
>>> I think there's something weird here:
>>>
>>>   cpus: offered 2.0 needed at least 1.0
>>>   mem : offered 1724.0 needed at least 1024.0
>>>   disk: offered 44124.0 needed at least 1024.0
>>>   ports:  at least 2 (sufficient)
>>>
>>>
>>> Am I misreading this? All of the requirements seem to be met.
>>>
>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>>
>>> int slots = mapSlotsMax + reduceSlotsMax;
>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>
>>>
>>> // Is this offer too small for even the minimum slots?
>>> if (slots < 1) {
>>>   return false;
>>> }
>>>
>>>
>>> Not exactly sure what this is doing.
>>>
>>> Sorry for the noise.
>>>
>>>
>>> On May 7, 2015, at 6:32 PM, Brian Topping <br...@gmail.com>
>>> wrote:
>>>
>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab has
>>> some more information necessary at this point... sorry for the omission..
>>>
>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
>>>
>>> Hi Brian,
>>>
>>> At this point you should see the TT attempting to be launched via Mesos.
>>> The "launched but not heartbeat yet" count tells us that the framework has
>>> accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>
>>> Do you see the task in your Meaos cluster UI, and is there anything
>>> interesting in the task logs?
>>>
>>> --
>>>
>>> Tom Arnfeld
>>> Developer // DueDil
>>>
>>> (+44) 7525940046
>>> 25 Christopher Street, London, EC2A 2BS
>>>
>>>
>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <br...@gmail.com>
>>> wrote:
>>>
>>>> Thanks guys, this was helpful. I started the job tracker as a service,
>>>> but apparently I never started the task tracker (or it failed to start and
>>>> I didn't notice). I started it after Haosdent's message, but wasn't able to
>>>> see any difference and I kept poking around.
>>>>
>>>> After making some changes and the VM wouldn't boot, my OCD got the
>>>> better of me and I reinstalled everything from scratch. There are just too
>>>> many moving parts to hassle you guys with an imperfect install on my end.
>>>>
>>>> This time through, I felt a lot more confident to use the Mesosphere
>>>> RPMs, but I couldn't find the best way to get things launched.
>>>> https://docs.mesosphere.com/reference/packages/ has a Last-Modified
>>>> of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't
>>>> have any init.d service descriptions as the packages page would indicate.
>>>> For now, I just launched them manually, but would like to get the machine
>>>> to completely load on boot as services.
>>>>
>>>> At this point, I have tested Mesos with:
>>>>
>>>>  mesos-execute
>>>> --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>>
>>>> The only problem there is it seems that "localhost" isn't good enough
>>>> for my install, it needs to be the FQDN, but it works and the job flows
>>>> through the UI.
>>>>
>>>> Now, back to a hadoop job. When I try the job now, the logs show the
>>>> following stream of repeated messages:
>>>>
>>>>  2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>>> Satisfied map and reduce slots needed.
>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler:
>>>> Unknown/exited TaskTracker: http://10.211.55.16:50060.
>>>> [Repeated a few times a second for five seconds]
>>>>
>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>>> JobTracker Status
>>>>
>>>>       Pending Map Tasks: 4
>>>>
>>>>    Pending Reduce Tasks: 1
>>>>       Running Map Tasks: 0
>>>>    Running Reduce Tasks: 0
>>>>          Idle Map Slots: 0
>>>>       Idle Reduce Slots: 0
>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>        Needed Map Slots: 0
>>>>     Needed Reduce Slots: 0
>>>>      Unhealthy Trackers: 0
>>>>
>>>>
>>>> This looks close.
>>>>
>>>> What's the best way to get a JDWP port set up to break in this code
>>>> (i.e. learning to fish...)?
>>>>
>>>> best, Brian
>>>>
>>>>
>>>>  On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io> wrote:
>>>>
>>>> From the mesos-master log and the JT log, it doesn't look like the
>>>> MesosScheduler ever registered with Mesos, which should mean that it
>>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does
>>>> seem to show a tasktracker running. Did you start that yourself (or
>>>> automatically as a system service)?
>>>>
>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <ha...@gmail.com> wrote:
>>>>
>>>>> Do you start tasktracker successfully?
>>>>>
>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <
>>>>> brian.topping@gmail.com> wrote:
>>>>>
>>>>>> Hi all, I'm happy to report that I'm very close to
>>>>>> getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the
>>>>>> hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes
>>>>>> to parse what I've got here and suggest something to try.
>>>>>>
>>>>>>  https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully
>>>>>> has all the data necessary between the console output of the client run,
>>>>>> the mesos master and slave console, the XML configuration of the JT and the
>>>>>> output that was generated by it. Please let me know if I've left something
>>>>>> out.
>>>>>>
>>>>>> I iterated a few times getting all the errors from missing paths or
>>>>>> libraries sorted out, but the example client ultimately just sits waiting
>>>>>> forever at "map 0% reduce 0%".
>>>>>>
>>>>>> Any input kindly appreciated!
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>>>
>>>>
>>>>
>>>>  <signature.asc>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

That's correct, but /usr/lib/hadoop/logs doesn't even exist. It should be logging to /var/log/hadoop.

> On May 8, 2015, at 2:38 PM, haosdent <ha...@gmail.com> wrote:
> 
> Seems you don't have permission for this directory:
> 
> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
> 
> at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> 
> 
> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output logs of what I am currently seeing. I've edited them for size, the message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>" appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it.
> 
> The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated "slots" problem in my previous message has resolved.
> 
> My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now.
> 
>> On May 8, 2015, at 1:56 PM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem.
>> 
>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> I think there's something weird here:
>>>   cpus: offered 2.0 needed at least 1.0
>>>   mem : offered 1724.0 needed at least 1024.0
>>>   disk: offered 44124.0 needed at least 1024.0
>>>   ports:  at least 2 (sufficient)
>> 
>> Am I misreading this? All of the requirements seem to be met.
>> 
>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>> 
>>> int slots = mapSlotsMax + reduceSlotsMax;
>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>> 
>>> // Is this offer too small for even the minimum slots?
>>> if (slots < 1) {
>>>   return false;
>>> }
>> 
>> Not exactly sure what this is doing.
>> 
>> Sorry for the noise.
>> 
>>> 
>>> On May 7, 2015, at 6:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..
>>> 
>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <ma...@duedil.com>> wrote:
>>>> 
>>>> Hi Brian,
>>>> 
>>>> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
>>>> 
>>>> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
>>>> 
>>>> --
>>>> 
>>>> Tom Arnfeld
>>>> Developer // DueDil
>>>> 
>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>> 25 Christopher Street, London, EC2A 2BS
>>>> 
>>>> 
>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
>>>> 
>>>> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
>>>> 
>>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
>>>> 
>>>> At this point, I have tested Mesos with:
>>>> 
>>>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>> 
>>>> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>>>> 
>>>> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>>>> 
>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>>>> [Repeated a few times a second for five seconds]
>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>>>>       Pending Map Tasks: 4
>>>>>    Pending Reduce Tasks: 1
>>>>>       Running Map Tasks: 0
>>>>>    Running Reduce Tasks: 0
>>>>>          Idle Map Slots: 0
>>>>>       Idle Reduce Slots: 0
>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>        Needed Map Slots: 0
>>>>>     Needed Reduce Slots: 0
>>>>>      Unhealthy Trackers: 0
>>>> 
>>>> This looks close.
>>>> 
>>>> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
>>>> 
>>>> best, Brian
>>>> 
>>>> 
>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>>>>> 
>>>>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>>>>> 
>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>>>> Do you start tasktracker successfully?
>>>>> 
>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>>>>> 
>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>>>>> 
>>>>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>>>>> 
>>>>> Any input kindly appreciated!
>>>>> 
>>>>> Brian
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>>> 
>>>> 
>>>> <signature.asc>
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

Re: Debugging hadoop-mesos

Posted by haosdent <ha...@gmail.com>.

Seems you don't have permission for this directory:

java.io.IOException: Could not create job user log directory:
file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001

at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
	

On Fri, May 8, 2015 at 3:32 PM, Brian Topping <br...@gmail.com>
wrote:

> Thanks Hasodent, I've updated
> https://gist.github.com/briantopping/311960f8e5454dbe9aab with the output
> logs of what I am currently seeing. I've edited them for size, the message
> "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker:
> http://10.211.55.16:50060" appeared a few thousand times in the logs. The
> configuration I have is probably still broken, 50060 is a Jetty port that
> returns a Cloudera string when telnetting to it.
>
> The error I saw below were apparently the result of building against the
> older version of CDH, when I updated the hadoop-mesos POM to match my
> deployment version, the incorrectly calculated "slots" problem in my
> previous message has resolved.
>
> My current problem is a Hadoop logging problem and nothing to do with
> Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in
> /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any
> difference. Just getting back into it now.
>
> On May 8, 2015, at 1:56 PM, haosdent <ha...@gmail.com> wrote:
>
> Could you post the log in executors which run jobtracker and taskstracks?
> It would be helpful to find the cause of this problem.
>
> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <br...@gmail.com>
> wrote:
>
>> I think there's something weird here:
>>
>>   cpus: offered 2.0 needed at least 1.0
>>   mem : offered 1724.0 needed at least 1024.0
>>   disk: offered 44124.0 needed at least 1024.0
>>   ports:  at least 2 (sufficient)
>>
>>
>> Am I misreading this? All of the requirements seem to be met.
>>
>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>
>> int slots = mapSlotsMax + reduceSlotsMax;
>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>
>>
>> // Is this offer too small for even the minimum slots?
>> if (slots < 1) {
>>   return false;
>> }
>>
>>
>> Not exactly sure what this is doing.
>>
>> Sorry for the noise.
>>
>>
>> On May 7, 2015, at 6:32 PM, Brian Topping <br...@gmail.com>
>> wrote:
>>
>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab has
>> some more information necessary at this point... sorry for the omission..
>>
>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
>>
>> Hi Brian,
>>
>> At this point you should see the TT attempting to be launched via Mesos.
>> The "launched but not heartbeat yet" count tells us that the framework has
>> accepted resources for 4 slots but the TT hasn't actually come up yet.
>>
>> Do you see the task in your Meaos cluster UI, and is there anything
>> interesting in the task logs?
>>
>> --
>>
>> Tom Arnfeld
>> Developer // DueDil
>>
>> (+44) 7525940046
>> 25 Christopher Street, London, EC2A 2BS
>>
>>
>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <br...@gmail.com>
>> wrote:
>>
>>> Thanks guys, this was helpful. I started the job tracker as a service,
>>> but apparently I never started the task tracker (or it failed to start and
>>> I didn't notice). I started it after Haosdent's message, but wasn't able to
>>> see any difference and I kept poking around.
>>>
>>> After making some changes and the VM wouldn't boot, my OCD got the
>>> better of me and I reinstalled everything from scratch. There are just too
>>> many moving parts to hassle you guys with an imperfect install on my end.
>>>
>>> This time through, I felt a lot more confident to use the Mesosphere
>>> RPMs, but I couldn't find the best way to get things launched.
>>> https://docs.mesosphere.com/reference/packages/ has a Last-Modified
>>> of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't
>>> have any init.d service descriptions as the packages page would indicate.
>>> For now, I just launched them manually, but would like to get the machine
>>> to completely load on boot as services.
>>>
>>> At this point, I have tested Mesos with:
>>>
>>>  mesos-execute
>>> --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>>
>>> The only problem there is it seems that "localhost" isn't good enough
>>> for my install, it needs to be the FQDN, but it works and the job flows
>>> through the UI.
>>>
>>> Now, back to a hadoop job. When I try the job now, the logs show the
>>> following stream of repeated messages:
>>>
>>>  2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>> Satisfied map and reduce slots needed.
>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler:
>>> Unknown/exited TaskTracker: http://10.211.55.16:50060.
>>> [Repeated a few times a second for five seconds]
>>>
>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy:
>>> JobTracker Status
>>>
>>>       Pending Map Tasks: 4
>>>
>>>    Pending Reduce Tasks: 1
>>>       Running Map Tasks: 0
>>>    Running Reduce Tasks: 0
>>>          Idle Map Slots: 0
>>>       Idle Reduce Slots: 0
>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>        Needed Map Slots: 0
>>>     Needed Reduce Slots: 0
>>>      Unhealthy Trackers: 0
>>>
>>>
>>> This looks close.
>>>
>>> What's the best way to get a JDWP port set up to break in this code
>>> (i.e. learning to fish...)?
>>>
>>> best, Brian
>>>
>>>
>>>  On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io> wrote:
>>>
>>> From the mesos-master log and the JT log, it doesn't look like the
>>> MesosScheduler ever registered with Mesos, which should mean that it
>>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does
>>> seem to show a tasktracker running. Did you start that yourself (or
>>> automatically as a system service)?
>>>
>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <ha...@gmail.com> wrote:
>>>
>>>> Do you start tasktracker successfully?
>>>>
>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi all, I'm happy to report that I'm very close to
>>>>> getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the
>>>>> hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes
>>>>> to parse what I've got here and suggest something to try.
>>>>>
>>>>>  https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully
>>>>> has all the data necessary between the console output of the client run,
>>>>> the mesos master and slave console, the XML configuration of the JT and the
>>>>> output that was generated by it. Please let me know if I've left something
>>>>> out.
>>>>>
>>>>> I iterated a few times getting all the errors from missing paths or
>>>>> libraries sorted out, but the example client ultimately just sits waiting
>>>>> forever at "map 0% reduce 0%".
>>>>>
>>>>> Any input kindly appreciated!
>>>>>
>>>>> Brian
>>>>>
>>>>
>>>>
>>>>
>>>>  --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>
>>>
>>>  <signature.asc>
>>
>>
>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output logs of what I am currently seeing. I've edited them for size, the message "INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060" appeared a few thousand times in the logs. The configuration I have is probably still broken, 50060 is a Jetty port that returns a Cloudera string when telnetting to it.

The error I saw below were apparently the result of building against the older version of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly calculated "slots" problem in my previous message has resolved.

My current problem is a Hadoop logging problem and nothing to do with Mesos, so I didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties, but it didn't make any difference. Just getting back into it now.

> On May 8, 2015, at 1:56 PM, haosdent <ha...@gmail.com> wrote:
> 
> Could you post the log in executors which run jobtracker and taskstracks? It would be helpful to find the cause of this problem.
> 
> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> I think there's something weird here:
>>   cpus: offered 2.0 needed at least 1.0
>>   mem : offered 1724.0 needed at least 1024.0
>>   disk: offered 44124.0 needed at least 1024.0
>>   ports:  at least 2 (sufficient)
> 
> Am I misreading this? All of the requirements seem to be met.
> 
> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
> 
>> int slots = mapSlotsMax + reduceSlotsMax;
>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>> 
>> // Is this offer too small for even the minimum slots?
>> if (slots < 1) {
>>   return false;
>> }
> 
> Not exactly sure what this is doing.
> 
> Sorry for the noise.
> 
>> 
>> On May 7, 2015, at 6:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..
>> 
>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <ma...@duedil.com>> wrote:
>>> 
>>> Hi Brian,
>>> 
>>> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
>>> 
>>> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
>>> 
>>> --
>>> 
>>> Tom Arnfeld
>>> Developer // DueDil
>>> 
>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>> 25 Christopher Street, London, EC2A 2BS
>>> 
>>> 
>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
>>> 
>>> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
>>> 
>>> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
>>> 
>>> At this point, I have tested Mesos with:
>>> 
>>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>> 
>>> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>>> 
>>> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>>> 
>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>>> [Repeated a few times a second for five seconds]
>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>>>       Pending Map Tasks: 4
>>>>    Pending Reduce Tasks: 1
>>>>       Running Map Tasks: 0
>>>>    Running Reduce Tasks: 0
>>>>          Idle Map Slots: 0
>>>>       Idle Reduce Slots: 0
>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>        Needed Map Slots: 0
>>>>     Needed Reduce Slots: 0
>>>>      Unhealthy Trackers: 0
>>> 
>>> This looks close.
>>> 
>>> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
>>> 
>>> best, Brian
>>> 
>>> 
>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>>>> 
>>>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>>>> 
>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>>> Do you start tasktracker successfully?
>>>> 
>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>>>> 
>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>>>> 
>>>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>>>> 
>>>> Any input kindly appreciated!
>>>> 
>>>> Brian
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>> 
>>> 
>>> <signature.asc>
>>> 
>> 
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

Re: Debugging hadoop-mesos

Posted by haosdent <ha...@gmail.com>.

Could you post the log in executors which run jobtracker and taskstracks?
It would be helpful to find the cause of this problem.

On Fri, May 8, 2015 at 3:05 AM, Brian Topping <br...@gmail.com>
wrote:

> I think there's something weird here:
>
>   cpus: offered 2.0 needed at least 1.0
>   mem : offered 1724.0 needed at least 1024.0
>   disk: offered 44124.0 needed at least 1024.0
>   ports:  at least 2 (sufficient)
>
>
> Am I misreading this? All of the requirements seem to be met.
>
> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>
> int slots = mapSlotsMax + reduceSlotsMax;
> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>
>
> // Is this offer too small for even the minimum slots?
> if (slots < 1) {
>   return false;
> }
>
>
> Not exactly sure what this is doing.
>
> Sorry for the noise.
>
>
> On May 7, 2015, at 6:32 PM, Brian Topping <br...@gmail.com> wrote:
>
> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab has
> some more information necessary at this point... sorry for the omission..
>
> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
>
> Hi Brian,
>
> At this point you should see the TT attempting to be launched via Mesos.
> The "launched but not heartbeat yet" count tells us that the framework has
> accepted resources for 4 slots but the TT hasn't actually come up yet.
>
> Do you see the task in your Meaos cluster UI, and is there anything
> interesting in the task logs?
>
> --
>
> Tom Arnfeld
> Developer // DueDil
>
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
>
>
> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <br...@gmail.com>
> wrote:
>
>> Thanks guys, this was helpful. I started the job tracker as a service,
>> but apparently I never started the task tracker (or it failed to start and
>> I didn't notice). I started it after Haosdent's message, but wasn't able to
>> see any difference and I kept poking around.
>>
>> After making some changes and the VM wouldn't boot, my OCD got the better
>> of me and I reinstalled everything from scratch. There are just too many
>> moving parts to hassle you guys with an imperfect install on my end.
>>
>> This time through, I felt a lot more confident to use the Mesosphere
>> RPMs, but I couldn't find the best way to get things launched.
>> https://docs.mesosphere.com/reference/packages/ has a Last-Modified
>> of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't
>> have any init.d service descriptions as the packages page would indicate.
>> For now, I just launched them manually, but would like to get the machine
>> to completely load on boot as services.
>>
>> At this point, I have tested Mesos with:
>>
>>  mesos-execute
>> --master="localhost:5050" --name="test-exec" --command="sleep 10"
>>
>> The only problem there is it seems that "localhost" isn't good enough for
>> my install, it needs to be the FQDN, but it works and the job flows through
>> the UI.
>>
>> Now, back to a hadoop job. When I try the job now, the logs show the
>> following stream of repeated messages:
>>
>>  2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy:
>> Satisfied map and reduce slots needed.
>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler:
>> Unknown/exited TaskTracker: http://10.211.55.16:50060.
>> [Repeated a few times a second for five seconds]
>>
>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy:
>> JobTracker Status
>>
>>       Pending Map Tasks: 4
>>
>>    Pending Reduce Tasks: 1
>>       Running Map Tasks: 0
>>    Running Reduce Tasks: 0
>>          Idle Map Slots: 0
>>       Idle Reduce Slots: 0
>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>        Needed Map Slots: 0
>>     Needed Reduce Slots: 0
>>      Unhealthy Trackers: 0
>>
>>
>> This looks close.
>>
>> What's the best way to get a JDWP port set up to break in this code (i.e.
>> learning to fish...)?
>>
>> best, Brian
>>
>>
>>  On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io> wrote:
>>
>> From the mesos-master log and the JT log, it doesn't look like the
>> MesosScheduler ever registered with Mesos, which should mean that it
>> wouldn't start any TTs or map/reduce tasks. However, your `ps` output does
>> seem to show a tasktracker running. Did you start that yourself (or
>> automatically as a system service)?
>>
>> On Wed, May 6, 2015 at 9:32 AM, haosdent <ha...@gmail.com> wrote:
>>
>>> Do you start tasktracker successfully?
>>>
>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <br...@gmail.com>
>>> wrote:
>>>
>>>> Hi all, I'm happy to report that I'm very close to
>>>> getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the
>>>> hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes
>>>> to parse what I've got here and suggest something to try.
>>>>
>>>>  https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully
>>>> has all the data necessary between the console output of the client run,
>>>> the mesos master and slave console, the XML configuration of the JT and the
>>>> output that was generated by it. Please let me know if I've left something
>>>> out.
>>>>
>>>> I iterated a few times getting all the errors from missing paths or
>>>> libraries sorted out, but the example client ultimately just sits waiting
>>>> forever at "map 0% reduce 0%".
>>>>
>>>> Any input kindly appreciated!
>>>>
>>>> Brian
>>>>
>>>
>>>
>>>
>>>  --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>>  <signature.asc>
>
>
>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

I think there's something weird here:
>   cpus: offered 2.0 needed at least 1.0
>   mem : offered 1724.0 needed at least 1024.0
>   disk: offered 44124.0 needed at least 1024.0
>   ports:  at least 2 (sufficient)

Am I misreading this? All of the requirements seem to be met.

Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:

> int slots = mapSlotsMax + reduceSlotsMax;
> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
> 
> // Is this offer too small for even the minimum slots?
> if (slots < 1) {
>   return false;
> }

Not exactly sure what this is doing.

Sorry for the noise.

> 
> On May 7, 2015, at 6:32 PM, Brian Topping <br...@gmail.com> wrote:
> 
> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..
> 
>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <ma...@duedil.com>> wrote:
>> 
>> Hi Brian,
>> 
>> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
>> 
>> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
>> 
>> --
>> 
>> Tom Arnfeld
>> Developer // DueDil
>> 
>> (+44) 7525940046
>> 25 Christopher Street, London, EC2A 2BS
>> 
>> 
>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
>> 
>> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
>> 
>> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
>> 
>> At this point, I have tested Mesos with:
>> 
>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
>> 
>> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>> 
>> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>> 
>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>> [Repeated a few times a second for five seconds]
>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>>       Pending Map Tasks: 4
>>>    Pending Reduce Tasks: 1
>>>       Running Map Tasks: 0
>>>    Running Reduce Tasks: 0
>>>          Idle Map Slots: 0
>>>       Idle Reduce Slots: 0
>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>        Needed Map Slots: 0
>>>     Needed Reduce Slots: 0
>>>      Unhealthy Trackers: 0
>> 
>> This looks close.
>> 
>> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
>> 
>> best, Brian
>> 
>> 
>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>>> 
>>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>>> 
>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>>> Do you start tasktracker successfully?
>>> 
>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>>> 
>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>>> 
>>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>>> 
>>> Any input kindly appreciated!
>>> 
>>> Brian
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>> 
>> 
>> <signature.asc>
>> 
>

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab> has some more information necessary at this point... sorry for the omission..

> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
> 
> Hi Brian,
> 
> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
> 
> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
> 
> --
> 
> Tom Arnfeld
> Developer // DueDil
> 
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
> 
> 
> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> 
> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
> 
> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
> 
> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
> 
> At this point, I have tested Mesos with:
> 
> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
> 
> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
> 
> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
> 
>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> [Repeated a few times a second for five seconds]
>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>       Pending Map Tasks: 4
>>    Pending Reduce Tasks: 1
>>       Running Map Tasks: 0
>>    Running Reduce Tasks: 0
>>          Idle Map Slots: 0
>>       Idle Reduce Slots: 0
>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>        Needed Map Slots: 0
>>     Needed Reduce Slots: 0
>>      Unhealthy Trackers: 0
> 
> This looks close.
> 
> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
> 
> best, Brian
> 
> 
>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>> 
>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>> 
>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> Do you start tasktracker successfully?
>> 
>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>> 
>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>> 
>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>> 
>> Any input kindly appreciated!
>> 
>> Brian
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>> 
> 
> <signature.asc>
>

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Thanks Tom! I do see activity in the cluster:

1. mesos-master.WARNING log -- sequence of repeat messages being generated:

> W0507 18:10:21.794231 11729 master.cpp:2661] Cannot kill task Task_Tracker_34 of framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914 because it is unknown; performing reconciliation

2. The mesos-slave.WARNING log shows "W0507 17:42:50.385308 11753 slave.cpp:1783] Cannot shut down unknown framework 20150507-164120-272093962-5050-11711-0004" from about the time that the job was launched.

3. mesos-master.INFO log -- sequence of repeat messages being generated :

> I0507 18:18:40.512228 11730 master.cpp:3760] Sending 1 offers to framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914
> I0507 18:18:40.514377 11729 master.cpp:2273] Processing ACCEPT call for offers: [ 20150507-164120-272093962-5050-11711-O556 ] on slave 20150507-164120-272093962-5050-11711-S0 at slave(1)@10.211.55.16:5051 (10.211.55.16) for framework 20150507-164120-272093962-5050-11711-0003 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at scheduler-2fed30f4-bbbe-47a5-a587-42202c792150@10.211.55.16:35914
> I0507 18:18:40.515120 11729 hierarchical.hpp:648] Recovered cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000] (total allocatable: cpus(*):6; mem(*):2803; disk(*):45148; ports(*):[31000-32000]) on slave 20150507-164120-272093962-5050-11711-S0 from framework 20150507-164120-272093962-5050-11711-0003
> I0507 18:18:41.798447 11724 http.cpp:516] HTTP request for '/master/state.json'

4. mesos-slave.INFO has nothing but resource allocation messages showing current disk usage.

5. The UI shows several terminated frameworks and one active (the one above). But the detail screen for that framework says there are no active or completed tasks.

Does this help?

> On May 7, 2015, at 6:05 PM, Tom Arnfeld <to...@duedil.com> wrote:
> 
> Hi Brian,
> 
> At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.
> 
> Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?
> 
> --
> 
> Tom Arnfeld
> Developer // DueDil
> 
> (+44) 7525940046
> 25 Christopher Street, London, EC2A 2BS
> 
> 
> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> 
> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
> 
> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
> 
> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
> 
> At this point, I have tested Mesos with:
> 
> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
> 
> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
> 
> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
> 
>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>> [Repeated a few times a second for five seconds]
>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>       Pending Map Tasks: 4
>>    Pending Reduce Tasks: 1
>>       Running Map Tasks: 0
>>    Running Reduce Tasks: 0
>>          Idle Map Slots: 0
>>       Idle Reduce Slots: 0
>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>        Needed Map Slots: 0
>>     Needed Reduce Slots: 0
>>      Unhealthy Trackers: 0
> 
> This looks close.
> 
> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
> 
> best, Brian
> 
> 
>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io <ma...@mesosphere.io>> wrote:
>> 
>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>> 
>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> Do you start tasktracker successfully?
>> 
>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>> 
>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>> 
>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>> 
>> Any input kindly appreciated!
>> 
>> Brian
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>> 
> 
> <signature.asc>
>

Re: Debugging hadoop-mesos

Posted by Tom Arnfeld <to...@duedil.com>.

Hi Brian,




At this point you should see the TT attempting to be launched via Mesos. The "launched but not heartbeat yet" count tells us that the framework has accepted resources for 4 slots but the TT hasn't actually come up yet.




Do you see the task in your Meaos cluster UI, and is there anything interesting in the task logs?



--


Tom Arnfeld

Developer // DueDil





(+44) 7525940046

25 Christopher Street, London, EC2A 2BS

On Thu, May 7, 2015 at 12:01 PM, Brian Topping <br...@gmail.com>
wrote:

> Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.
> After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.
> This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.
> At this point, I have tested Mesos with:
> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"
> The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.
> Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:
>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
>> [Repeated a few times a second for five seconds]
>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>>       Pending Map Tasks: 4
>>    Pending Reduce Tasks: 1
>>       Running Map Tasks: 0
>>    Running Reduce Tasks: 0
>>          Idle Map Slots: 0
>>       Idle Reduce Slots: 0
>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>        Needed Map Slots: 0
>>     Needed Reduce Slots: 0
>>      Unhealthy Trackers: 0
> This looks close.
> What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?
> best, Brian
>> On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io> wrote:
>> 
>> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
>> 
>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
>> Do you start tasktracker successfully?
>> 
>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
>> 
>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
>> 
>> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
>> 
>> Any input kindly appreciated!
>> 
>> Brian
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>>

Re: Debugging hadoop-mesos

Posted by Brian Topping <br...@gmail.com>.

Thanks guys, this was helpful. I started the job tracker as a service, but apparently I never started the task tracker (or it failed to start and I didn't notice). I started it after Haosdent's message, but wasn't able to see any difference and I kept poking around.

After making some changes and the VM wouldn't boot, my OCD got the better of me and I reinstalled everything from scratch. There are just too many moving parts to hassle you guys with an imperfect install on my end.

This time through, I felt a lot more confident to use the Mesosphere RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/ <https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May 2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions as the packages page would indicate. For now, I just launched them manually, but would like to get the machine to completely load on boot as services.

At this point, I have tested Mesos with:

	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep 10"

The only problem there is it seems that "localhost" isn't good enough for my install, it needs to be the FQDN, but it works and the job flows through the UI.

Now, back to a hadoop job. When I try the job now, the logs show the following stream of repeated messages:

> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy: Satisfied map and reduce slots needed.
> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited TaskTracker: http://10.211.55.16:50060.
> [Repeated a few times a second for five seconds]
> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
>       Pending Map Tasks: 4
>    Pending Reduce Tasks: 1
>       Running Map Tasks: 0
>    Running Reduce Tasks: 0
>          Idle Map Slots: 0
>       Idle Reduce Slots: 0
>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>        Needed Map Slots: 0
>     Needed Reduce Slots: 0
>      Unhealthy Trackers: 0

This looks close.

What's the best way to get a JDWP port set up to break in this code (i.e. learning to fish...)?

best, Brian

> On May 7, 2015, at 12:11 PM, Adam Bordelon <ad...@mesosphere.io> wrote:
> 
> From the mesos-master log and the JT log, it doesn't look like the MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did you start that yourself (or automatically as a system service)?
> 
> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
> Do you start tasktracker successfully?
> 
> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com <ma...@gmail.com>> wrote:
> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0 integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone might have a few minutes to parse what I've got here and suggest something to try.
> 
> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219> hopefully has all the data necessary between the console output of the client run, the mesos master and slave console, the XML configuration of the JT and the output that was generated by it. Please let me know if I've left something out.
> 
> I iterated a few times getting all the errors from missing paths or libraries sorted out, but the example client ultimately just sits waiting forever at "map 0% reduce 0%".
> 
> Any input kindly appreciated!
> 
> Brian
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang
>

Re: Debugging hadoop-mesos

Posted by Adam Bordelon <ad...@mesosphere.io>.

>From the mesos-master log and the JT log, it doesn't look like the
MesosScheduler ever registered with Mesos, which should mean that it
wouldn't start any TTs or map/reduce tasks. However, your `ps` output does
seem to show a tasktracker running. Did you start that yourself (or
automatically as a system service)?

On Wed, May 6, 2015 at 9:32 AM, haosdent <ha...@gmail.com> wrote:

> Do you start tasktracker successfully?
>
> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <br...@gmail.com>
> wrote:
>
>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0
>> integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github.
>> Hoping someone might have a few minutes to parse what I've got here and
>> suggest something to try.
>>
>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully has
>> all the data necessary between the console output of the client run, the
>> mesos master and slave console, the XML configuration of the JT and the
>> output that was generated by it. Please let me know if I've left something
>> out.
>>
>> I iterated a few times getting all the errors from missing paths or
>> libraries sorted out, but the example client ultimately just sits waiting
>> forever at "map 0% reduce 0%".
>>
>> Any input kindly appreciated!
>>
>> Brian
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>

Re: Debugging hadoop-mesos

Posted by haosdent <ha...@gmail.com>.

Do you start tasktracker successfully?

On Wed, May 6, 2015 at 11:32 PM, Brian Topping <br...@gmail.com>
wrote:

> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0
> integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github.
> Hoping someone might have a few minutes to parse what I've got here and
> suggest something to try.
>
> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 hopefully has
> all the data necessary between the console output of the client run, the
> mesos master and slave console, the XML configuration of the JT and the
> output that was generated by it. Please let me know if I've left something
> out.
>
> I iterated a few times getting all the errors from missing paths or
> libraries sorted out, but the example client ultimately just sits waiting
> forever at "map 0% reduce 0%".
>
> Any input kindly appreciated!
>
> Brian
>



-- 
Best Regards,
Haosdent Huang