You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Howell <de...@gmail.com> on 2010/04/03 03:16:52 UTC

losing network interfaces during long running map-reduce jobs

I'm encountering a completely bizarre failure mode in my Hadoop
cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to
CDH 2.

Ever since then, my tasktracker/ datenode machines have been regularly
losing their networking during long (> 1 hour) jobs. Restarting the
network interface brings them back online immediately.

I'm mystified as to how this can be happening: anyone care to venture
a hypothesis? I'm running on Centos 5.2.

Cheers,
David

Re: losing network interfaces during long running map-reduce jobs

Posted by Patrick Angeles <pa...@cloudera.com>.
Hi David,

Strange indeed. I assume nothing in your configs changed. Anything funny in
the logs? You should also rule out the switch itself as being faulty.

It's possible that CDH2 has a patch that's not in 0.20.1 that's causing this
problem, but we haven't heard this exact problem from any of our other
customers / users.

- P

On Fri, Apr 2, 2010 at 6:16 PM, David Howell <de...@gmail.com> wrote:

> I'm encountering a completely bizarre failure mode in my Hadoop
> cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to
> CDH 2.
>
> Ever since then, my tasktracker/ datenode machines have been regularly
> losing their networking during long (> 1 hour) jobs. Restarting the
> network interface brings them back online immediately.
>
> I'm mystified as to how this can be happening: anyone care to venture
> a hypothesis? I'm running on Centos 5.2.
>
> Cheers,
> David
>

Re: losing network interfaces during long running map-reduce jobs

Posted by David Howell <de...@gmail.com>.
> could be just file handles you are losing; have up upped the OS defaults?
>

I have not, and that does seem like a likely culprit. Although, it's a
bit alarming that asking for one socket too many could take down the
networking stack...

Re: losing network interfaces during long running map-reduce jobs

Posted by Steve Loughran <st...@apache.org>.
David Howell wrote:
>> But I haven't seen anything in the dmesg log. I'll have to try looking
>> at the tcpdump output on Monday, once I can get console access again.
>> My apologies that I'm so sketchy on details right now... so far, I
>> haven't been any able to find any evidence of something going wrong
>> except for the hadoop log entries when the IOExceptions start.
>>
>> Thanks,
>> -David
>>
> 
> I just lost my networking again. This time, I had switched my cluster
> back to the build I was using before I switched to CDH2.
> 
> It's Hadoop 0.20.1 with these patches applied (for Dumbo):
> 
> HADOOP-1722-v0.20.1
> HADOOP-5450
> MAPREDUCE-764
> HADOOP-5528
> 
> Now I'm wondering if something about my job is the culprit. I have 2
> nodes, both 8 core machines.
> mapred.tasktracker.map|reduce.tasks.maximum are both set to 7.
> 
> The job I'm running is combining lots of gzipped Apache log files into
> sequence files for later analysis... I'm going from one file per
> virtual host per server per day to file per virtual host per day. The
> last attempt had ~1400 maps/10 reduces.
> 


could be just file handles you are losing; have up upped the OS defaults?

Re: losing network interfaces during long running map-reduce jobs

Posted by David Howell <de...@gmail.com>.
> But I haven't seen anything in the dmesg log. I'll have to try looking
> at the tcpdump output on Monday, once I can get console access again.
> My apologies that I'm so sketchy on details right now... so far, I
> haven't been any able to find any evidence of something going wrong
> except for the hadoop log entries when the IOExceptions start.
>
> Thanks,
> -David
>

I just lost my networking again. This time, I had switched my cluster
back to the build I was using before I switched to CDH2.

It's Hadoop 0.20.1 with these patches applied (for Dumbo):

HADOOP-1722-v0.20.1
HADOOP-5450
MAPREDUCE-764
HADOOP-5528

Now I'm wondering if something about my job is the culprit. I have 2
nodes, both 8 core machines.
mapred.tasktracker.map|reduce.tasks.maximum are both set to 7.

The job I'm running is combining lots of gzipped Apache log files into
sequence files for later analysis... I'm going from one file per
virtual host per server per day to file per virtual host per day. The
last attempt had ~1400 maps/10 reduces.

Is this job some kind of map-reduce anti-pattern that's causing problems?

Here's the source to mapper and reducer:
http://gist.github.com/356750

Cheers,
David

Re: losing network interfaces during long running map-reduce jobs

Posted by David Howell <de...@gmail.com>.
> Could you clarify wha you mean by "losing their networking"? Can you ping
> the node externally? If you access the node via the console (via ILOM, etc)
> and run tcpdump or tshark, can you see ethernet broadcast traffic at all? Do
> you see anything in dmesg on the machine in question?
>
> Thanks
> -Todd

My cluster is small and the physical servers managed by my company's
IT department... I just admin the Hadoop install and I don't have
access except through ssh. When one of my nodes goes unresponsive, it
doesn't respond to ping, ssh, or any traffic on any port. I've been
limited so far to trying to investigate logs after my sysadmin
restarts the networking interface.

But I haven't seen anything in the dmesg log. I'll have to try looking
at the tcpdump output on Monday, once I can get console access again.
My apologies that I'm so sketchy on details right now... so far, I
haven't been any able to find any evidence of something going wrong
except for the hadoop log entries when the IOExceptions start.

Thanks,
-David

Re: losing network interfaces during long running map-reduce jobs

Posted by Todd Lipcon <to...@cloudera.com>.
Hi David,

On Fri, Apr 2, 2010 at 6:16 PM, David Howell <de...@gmail.com> wrote:

> I'm encountering a completely bizarre failure mode in my Hadoop
> cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to
> CDH 2.
>
> Ever since then, my tasktracker/ datenode machines have been regularly
> losing their networking during long (> 1 hour) jobs. Restarting the
> network interface brings them back online immediately.
>
>
Could you clarify wha you mean by "losing their networking"? Can you ping
the node externally? If you access the node via the console (via ILOM, etc)
and run tcpdump or tshark, can you see ethernet broadcast traffic at all? Do
you see anything in dmesg on the machine in question?

Thanks
-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera