You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Andrei <fa...@gmail.com> on 2013/07/10 14:02:00 UTC

ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example
from distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
attempt_1373454026937_0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of
failed attempts becomes critical, and the whole job fails, in other cases
job is finished successfully. I can't see any dependency, but I noticed the
following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
_slave-2-host_ there will be corresponding syslog with the following
contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.net.ConnectException: Call From slave-2-host/
127.0.0.1 to slave-2-host:11812 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
        at com.sun.proxy.$Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
        at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
        at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
        at
org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
        at org.apache.hadoop.ipc.Client.call(Client.java:1196)
        ... 3 more


Notice several things:

1. This exception always happens on the different host than
ApplicationMaster runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish
connection, but what it wants to connect to?
2. Why this error happens and how can I fix it?

Any suggestions are welcome.

Thanks,
Andrei

Re: EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Thanks
I will look into logs to see if I see anything else…
sanjay


From: Colin McCabe <cm...@alumni.cmu.edu>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, July 10, 2013 11:52 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: EBADF: Bad file descriptor

To clarify a little bit, the readahead pool can sometimes spit out this message if you close a file while a readahead request is in flight.  It's not an error and just reflects the fact that the file was closed hastily, probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>> wrote:
That's just a warning message.  It's not causing your problem-- it's just a symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name<http://mapred.job.queue.name>     default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name<http://mapreduce.job.name>   OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Thanks
I will look into logs to see if I see anything else…
sanjay


From: Colin McCabe <cm...@alumni.cmu.edu>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, July 10, 2013 11:52 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: EBADF: Bad file descriptor

To clarify a little bit, the readahead pool can sometimes spit out this message if you close a file while a readahead request is in flight.  It's not an error and just reflects the fact that the file was closed hastily, probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>> wrote:
That's just a warning message.  It's not causing your problem-- it's just a symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name<http://mapred.job.queue.name>     default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name<http://mapreduce.job.name>   OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Thanks
I will look into logs to see if I see anything else…
sanjay


From: Colin McCabe <cm...@alumni.cmu.edu>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, July 10, 2013 11:52 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: EBADF: Bad file descriptor

To clarify a little bit, the readahead pool can sometimes spit out this message if you close a file while a readahead request is in flight.  It's not an error and just reflects the fact that the file was closed hastily, probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>> wrote:
That's just a warning message.  It's not causing your problem-- it's just a symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name<http://mapred.job.queue.name>     default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name<http://mapreduce.job.name>   OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Thanks
I will look into logs to see if I see anything else…
sanjay


From: Colin McCabe <cm...@alumni.cmu.edu>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, July 10, 2013 11:52 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: EBADF: Bad file descriptor

To clarify a little bit, the readahead pool can sometimes spit out this message if you close a file while a readahead request is in flight.  It's not an error and just reflects the fact that the file was closed hastily, probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>> wrote:
That's just a warning message.  It's not causing your problem-- it's just a symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name<http://mapred.job.queue.name>     default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name<http://mapreduce.job.name>   OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
To clarify a little bit, the readahead pool can sometimes spit out this
message if you close a file while a readahead request is in flight.  It's
not an error and just reflects the fact that the file was closed hastily,
probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>wrote:

> That's just a warning message.  It's not causing your problem-- it's just
> a symptom.
>
> You will have to find out why the MR job failed.
>
> best,
> Colin
>
>
> On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
>> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
>> EBADF: Bad file descriptor
>> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
>> at
>> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
>> at
>> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>>
>>  Hi
>>
>>  I have a Oozie workflow that runs a MR job and I have started getting
>> this error past two days in one of the MR jobs that is being processed.
>> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
>> the process.
>>
>>  Any clues ?
>>
>>  Or should I post this issue in the Oozie postings ?
>>
>>  Thanks
>>
>>  sanjay
>>
>>    Configuration   Name Value   impression.log.record.cached.tag cached=
>> impression.log.record.end.tag [end
>> impressions.mapreduce.conf.file.full.path
>> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
>> mapred.job.queue.name default  mapred.mapper.new-api true
>> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
>> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
>> mapreduce.job.inputformat.class
>> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
>> mapreduce.job.map.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
>> mapreduce.job.maps 500  mapreduce.job.name
>> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
>> mapreduce.job.output.value.class org.apache.hadoop.io.Text
>> mapreduce.job.outputformat.class
>> com.wizecommerce.utils.mapred.NextagTextOutputFormat
>> mapreduce.job.reduce.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
>> mapreduce.job.reduces 8  mapreduce.map.output.compress true
>> mapreduce.map.output.compress.codec
>> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
>> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
>> mapreduce.map.output.value.class
>> com.wizecommerce.parser.dao.OutpdirLogRecord
>> mapreduce.output.fileoutputformat.compress true
>> mapreduce.output.fileoutputformat.compress.codec
>> com.hadoop.compression.lzo.LzopCodec
>> mapreduce.output.fileoutputformat.outputdir
>> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
>> mapreduce.tasktracker.map.tasks.maximum 12
>> mapreduce.tasktracker.reduce.tasks.maximum 8
>> outpdir.log.exclude.processing.datatypes header,sellerhidden
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
To clarify a little bit, the readahead pool can sometimes spit out this
message if you close a file while a readahead request is in flight.  It's
not an error and just reflects the fact that the file was closed hastily,
probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>wrote:

> That's just a warning message.  It's not causing your problem-- it's just
> a symptom.
>
> You will have to find out why the MR job failed.
>
> best,
> Colin
>
>
> On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
>> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
>> EBADF: Bad file descriptor
>> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
>> at
>> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
>> at
>> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>>
>>  Hi
>>
>>  I have a Oozie workflow that runs a MR job and I have started getting
>> this error past two days in one of the MR jobs that is being processed.
>> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
>> the process.
>>
>>  Any clues ?
>>
>>  Or should I post this issue in the Oozie postings ?
>>
>>  Thanks
>>
>>  sanjay
>>
>>    Configuration   Name Value   impression.log.record.cached.tag cached=
>> impression.log.record.end.tag [end
>> impressions.mapreduce.conf.file.full.path
>> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
>> mapred.job.queue.name default  mapred.mapper.new-api true
>> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
>> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
>> mapreduce.job.inputformat.class
>> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
>> mapreduce.job.map.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
>> mapreduce.job.maps 500  mapreduce.job.name
>> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
>> mapreduce.job.output.value.class org.apache.hadoop.io.Text
>> mapreduce.job.outputformat.class
>> com.wizecommerce.utils.mapred.NextagTextOutputFormat
>> mapreduce.job.reduce.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
>> mapreduce.job.reduces 8  mapreduce.map.output.compress true
>> mapreduce.map.output.compress.codec
>> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
>> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
>> mapreduce.map.output.value.class
>> com.wizecommerce.parser.dao.OutpdirLogRecord
>> mapreduce.output.fileoutputformat.compress true
>> mapreduce.output.fileoutputformat.compress.codec
>> com.hadoop.compression.lzo.LzopCodec
>> mapreduce.output.fileoutputformat.outputdir
>> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
>> mapreduce.tasktracker.map.tasks.maximum 12
>> mapreduce.tasktracker.reduce.tasks.maximum 8
>> outpdir.log.exclude.processing.datatypes header,sellerhidden
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
To clarify a little bit, the readahead pool can sometimes spit out this
message if you close a file while a readahead request is in flight.  It's
not an error and just reflects the fact that the file was closed hastily,
probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>wrote:

> That's just a warning message.  It's not causing your problem-- it's just
> a symptom.
>
> You will have to find out why the MR job failed.
>
> best,
> Colin
>
>
> On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
>> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
>> EBADF: Bad file descriptor
>> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
>> at
>> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
>> at
>> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>>
>>  Hi
>>
>>  I have a Oozie workflow that runs a MR job and I have started getting
>> this error past two days in one of the MR jobs that is being processed.
>> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
>> the process.
>>
>>  Any clues ?
>>
>>  Or should I post this issue in the Oozie postings ?
>>
>>  Thanks
>>
>>  sanjay
>>
>>    Configuration   Name Value   impression.log.record.cached.tag cached=
>> impression.log.record.end.tag [end
>> impressions.mapreduce.conf.file.full.path
>> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
>> mapred.job.queue.name default  mapred.mapper.new-api true
>> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
>> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
>> mapreduce.job.inputformat.class
>> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
>> mapreduce.job.map.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
>> mapreduce.job.maps 500  mapreduce.job.name
>> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
>> mapreduce.job.output.value.class org.apache.hadoop.io.Text
>> mapreduce.job.outputformat.class
>> com.wizecommerce.utils.mapred.NextagTextOutputFormat
>> mapreduce.job.reduce.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
>> mapreduce.job.reduces 8  mapreduce.map.output.compress true
>> mapreduce.map.output.compress.codec
>> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
>> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
>> mapreduce.map.output.value.class
>> com.wizecommerce.parser.dao.OutpdirLogRecord
>> mapreduce.output.fileoutputformat.compress true
>> mapreduce.output.fileoutputformat.compress.codec
>> com.hadoop.compression.lzo.LzopCodec
>> mapreduce.output.fileoutputformat.outputdir
>> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
>> mapreduce.tasktracker.map.tasks.maximum 12
>> mapreduce.tasktracker.reduce.tasks.maximum 8
>> outpdir.log.exclude.processing.datatypes header,sellerhidden
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
To clarify a little bit, the readahead pool can sometimes spit out this
message if you close a file while a readahead request is in flight.  It's
not an error and just reflects the fact that the file was closed hastily,
probably because of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe <cm...@alumni.cmu.edu>wrote:

> That's just a warning message.  It's not causing your problem-- it's just
> a symptom.
>
> You will have to find out why the MR job failed.
>
> best,
> Colin
>
>
> On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
>> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
>> EBADF: Bad file descriptor
>> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
>> at
>> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
>> at
>> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>>
>>  Hi
>>
>>  I have a Oozie workflow that runs a MR job and I have started getting
>> this error past two days in one of the MR jobs that is being processed.
>> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
>> the process.
>>
>>  Any clues ?
>>
>>  Or should I post this issue in the Oozie postings ?
>>
>>  Thanks
>>
>>  sanjay
>>
>>    Configuration   Name Value   impression.log.record.cached.tag cached=
>> impression.log.record.end.tag [end
>> impressions.mapreduce.conf.file.full.path
>> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
>> mapred.job.queue.name default  mapred.mapper.new-api true
>> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
>> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
>> mapreduce.job.inputformat.class
>> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
>> mapreduce.job.map.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
>> mapreduce.job.maps 500  mapreduce.job.name
>> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
>> mapreduce.job.output.value.class org.apache.hadoop.io.Text
>> mapreduce.job.outputformat.class
>> com.wizecommerce.utils.mapred.NextagTextOutputFormat
>> mapreduce.job.reduce.class
>> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
>> mapreduce.job.reduces 8  mapreduce.map.output.compress true
>> mapreduce.map.output.compress.codec
>> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
>> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
>> mapreduce.map.output.value.class
>> com.wizecommerce.parser.dao.OutpdirLogRecord
>> mapreduce.output.fileoutputformat.compress true
>> mapreduce.output.fileoutputformat.compress.codec
>> com.hadoop.compression.lzo.LzopCodec
>> mapreduce.output.fileoutputformat.outputdir
>> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
>> mapreduce.tasktracker.map.tasks.maximum 12
>> mapreduce.tasktracker.reduce.tasks.maximum 8
>> outpdir.log.exclude.processing.datatypes header,sellerhidden
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
That's just a warning message.  It's not causing your problem-- it's just a
symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
> EBADF: Bad file descriptor
> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
> at
> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
> at
> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>  Hi
>
>  I have a Oozie workflow that runs a MR job and I have started getting
> this error past two days in one of the MR jobs that is being processed.
> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
> the process.
>
>  Any clues ?
>
>  Or should I post this issue in the Oozie postings ?
>
>  Thanks
>
>  sanjay
>
>    Configuration   Name Value   impression.log.record.cached.tag cached=
> impression.log.record.end.tag [end
> impressions.mapreduce.conf.file.full.path
> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
> mapred.job.queue.name default  mapred.mapper.new-api true
> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
> mapreduce.job.inputformat.class
> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
> mapreduce.job.map.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
> mapreduce.job.maps 500  mapreduce.job.name
> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
> mapreduce.job.output.value.class org.apache.hadoop.io.Text
> mapreduce.job.outputformat.class
> com.wizecommerce.utils.mapred.NextagTextOutputFormat
> mapreduce.job.reduce.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
> mapreduce.job.reduces 8  mapreduce.map.output.compress true
> mapreduce.map.output.compress.codec
> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> mapreduce.map.output.value.class
> com.wizecommerce.parser.dao.OutpdirLogRecord
> mapreduce.output.fileoutputformat.compress true
> mapreduce.output.fileoutputformat.compress.codec
> com.hadoop.compression.lzo.LzopCodec
> mapreduce.output.fileoutputformat.outputdir
> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
> mapreduce.tasktracker.map.tasks.maximum 12
> mapreduce.tasktracker.reduce.tasks.maximum 8
> outpdir.log.exclude.processing.datatypes header,sellerhidden
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
That's just a warning message.  It's not causing your problem-- it's just a
symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
> EBADF: Bad file descriptor
> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
> at
> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
> at
> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>  Hi
>
>  I have a Oozie workflow that runs a MR job and I have started getting
> this error past two days in one of the MR jobs that is being processed.
> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
> the process.
>
>  Any clues ?
>
>  Or should I post this issue in the Oozie postings ?
>
>  Thanks
>
>  sanjay
>
>    Configuration   Name Value   impression.log.record.cached.tag cached=
> impression.log.record.end.tag [end
> impressions.mapreduce.conf.file.full.path
> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
> mapred.job.queue.name default  mapred.mapper.new-api true
> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
> mapreduce.job.inputformat.class
> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
> mapreduce.job.map.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
> mapreduce.job.maps 500  mapreduce.job.name
> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
> mapreduce.job.output.value.class org.apache.hadoop.io.Text
> mapreduce.job.outputformat.class
> com.wizecommerce.utils.mapred.NextagTextOutputFormat
> mapreduce.job.reduce.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
> mapreduce.job.reduces 8  mapreduce.map.output.compress true
> mapreduce.map.output.compress.codec
> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> mapreduce.map.output.value.class
> com.wizecommerce.parser.dao.OutpdirLogRecord
> mapreduce.output.fileoutputformat.compress true
> mapreduce.output.fileoutputformat.compress.codec
> com.hadoop.compression.lzo.LzopCodec
> mapreduce.output.fileoutputformat.outputdir
> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
> mapreduce.tasktracker.map.tasks.maximum 12
> mapreduce.tasktracker.reduce.tasks.maximum 8
> outpdir.log.exclude.processing.datatypes header,sellerhidden
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
That's just a warning message.  It's not causing your problem-- it's just a
symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
> EBADF: Bad file descriptor
> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
> at
> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
> at
> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>  Hi
>
>  I have a Oozie workflow that runs a MR job and I have started getting
> this error past two days in one of the MR jobs that is being processed.
> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
> the process.
>
>  Any clues ?
>
>  Or should I post this issue in the Oozie postings ?
>
>  Thanks
>
>  sanjay
>
>    Configuration   Name Value   impression.log.record.cached.tag cached=
> impression.log.record.end.tag [end
> impressions.mapreduce.conf.file.full.path
> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
> mapred.job.queue.name default  mapred.mapper.new-api true
> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
> mapreduce.job.inputformat.class
> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
> mapreduce.job.map.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
> mapreduce.job.maps 500  mapreduce.job.name
> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
> mapreduce.job.output.value.class org.apache.hadoop.io.Text
> mapreduce.job.outputformat.class
> com.wizecommerce.utils.mapred.NextagTextOutputFormat
> mapreduce.job.reduce.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
> mapreduce.job.reduces 8  mapreduce.map.output.compress true
> mapreduce.map.output.compress.codec
> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> mapreduce.map.output.value.class
> com.wizecommerce.parser.dao.OutpdirLogRecord
> mapreduce.output.fileoutputformat.compress true
> mapreduce.output.fileoutputformat.compress.codec
> com.hadoop.compression.lzo.LzopCodec
> mapreduce.output.fileoutputformat.outputdir
> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
> mapreduce.tasktracker.map.tasks.maximum 12
> mapreduce.tasktracker.reduce.tasks.maximum 8
> outpdir.log.exclude.processing.datatypes header,sellerhidden
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: EBADF: Bad file descriptor

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
That's just a warning message.  It's not causing your problem-- it's just a
symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  2013-07-10 07:11:50,131 WARN [Readahead Thread #1]
> org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
> EBADF: Bad file descriptor
> at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
> at
> org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
> at
> org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>  Hi
>
>  I have a Oozie workflow that runs a MR job and I have started getting
> this error past two days in one of the MR jobs that is being processed.
> However if I run it again , it succeeds :-(  but about 1 hr is wasted in
> the process.
>
>  Any clues ?
>
>  Or should I post this issue in the Oozie postings ?
>
>  Thanks
>
>  sanjay
>
>    Configuration   Name Value   impression.log.record.cached.tag cached=
> impression.log.record.end.tag [end
> impressions.mapreduce.conf.file.full.path
> /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
> mapred.job.queue.name default  mapred.mapper.new-api true
> mapred.reducer.new-api true  mapreduce.input.fileinputformat.inputdir
> /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
> mapreduce.job.inputformat.class
> com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
> mapreduce.job.map.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
> mapreduce.job.maps 500  mapreduce.job.name
> OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
> mapreduce.job.output.value.class org.apache.hadoop.io.Text
> mapreduce.job.outputformat.class
> com.wizecommerce.utils.mapred.NextagTextOutputFormat
> mapreduce.job.reduce.class
> com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
> mapreduce.job.reduces 8  mapreduce.map.output.compress true
> mapreduce.map.output.compress.codec
> org.apache.hadoop.io.compress.SnappyCodec  mapreduce.map.output.key.class
> org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> mapreduce.map.output.value.class
> com.wizecommerce.parser.dao.OutpdirLogRecord
> mapreduce.output.fileoutputformat.compress true
> mapreduce.output.fileoutputformat.compress.codec
> com.hadoop.compression.lzo.LzopCodec
> mapreduce.output.fileoutputformat.outputdir
> /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
> mapreduce.tasktracker.map.tasks.maximum 12
> mapreduce.tasktracker.reduce.tasks.maximum 8
> outpdir.log.exclude.processing.datatypes header,sellerhidden
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Here are logs of RM and 2 NMs:

RM (master-host): http://pastebin.com/q4qJP8Ld
NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG
NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp

The only related error I've found in them is the following (from RM logs):

...
2013-07-11 07:46:06,225 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_000001
2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from
10.128.40.184:47101: output error
2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8030 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140)
at org.apache.hadoop.ipc.Server.access$2000(Server.java:108)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747)
2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved my_user to /default-rack
2013-07-11 07:46:11,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with
capability: 8192, assigned nodeId my_user:59267
...

Though from stack trace it's hard to tell where this error came from.

Let me know if you need any more information.










On Thu, Jul 11, 2013 at 1:00 AM, Andrei <fa...@gmail.com> wrote:

> Hi Omkar,
>
> I'm out of office now, so I'll post it as fast as get back there.
>
> Thanks
>
>
> On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> can you post RM/NM logs too.?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Here are logs of RM and 2 NMs:

RM (master-host): http://pastebin.com/q4qJP8Ld
NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG
NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp

The only related error I've found in them is the following (from RM logs):

...
2013-07-11 07:46:06,225 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_000001
2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from
10.128.40.184:47101: output error
2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8030 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140)
at org.apache.hadoop.ipc.Server.access$2000(Server.java:108)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747)
2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved my_user to /default-rack
2013-07-11 07:46:11,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with
capability: 8192, assigned nodeId my_user:59267
...

Though from stack trace it's hard to tell where this error came from.

Let me know if you need any more information.










On Thu, Jul 11, 2013 at 1:00 AM, Andrei <fa...@gmail.com> wrote:

> Hi Omkar,
>
> I'm out of office now, so I'll post it as fast as get back there.
>
> Thanks
>
>
> On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> can you post RM/NM logs too.?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Here are logs of RM and 2 NMs:

RM (master-host): http://pastebin.com/q4qJP8Ld
NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG
NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp

The only related error I've found in them is the following (from RM logs):

...
2013-07-11 07:46:06,225 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_000001
2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from
10.128.40.184:47101: output error
2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8030 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140)
at org.apache.hadoop.ipc.Server.access$2000(Server.java:108)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747)
2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved my_user to /default-rack
2013-07-11 07:46:11,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with
capability: 8192, assigned nodeId my_user:59267
...

Though from stack trace it's hard to tell where this error came from.

Let me know if you need any more information.










On Thu, Jul 11, 2013 at 1:00 AM, Andrei <fa...@gmail.com> wrote:

> Hi Omkar,
>
> I'm out of office now, so I'll post it as fast as get back there.
>
> Thanks
>
>
> On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> can you post RM/NM logs too.?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Here are logs of RM and 2 NMs:

RM (master-host): http://pastebin.com/q4qJP8Ld
NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG
NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp

The only related error I've found in them is the following (from RM logs):

...
2013-07-11 07:46:06,225 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_000001
2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from
10.128.40.184:47101: output error
2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8030 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140)
at org.apache.hadoop.ipc.Server.access$2000(Server.java:108)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747)
2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved my_user to /default-rack
2013-07-11 07:46:11,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with
capability: 8192, assigned nodeId my_user:59267
...

Though from stack trace it's hard to tell where this error came from.

Let me know if you need any more information.










On Thu, Jul 11, 2013 at 1:00 AM, Andrei <fa...@gmail.com> wrote:

> Hi Omkar,
>
> I'm out of office now, so I'll post it as fast as get back there.
>
> Thanks
>
>
> On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> can you post RM/NM logs too.?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Omkar,

I'm out of office now, so I'll post it as fast as get back there.

Thanks


On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:

> can you post RM/NM logs too.?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Omkar,

I'm out of office now, so I'll post it as fast as get back there.

Thanks


On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:

> can you post RM/NM logs too.?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Omkar,

I'm out of office now, so I'll post it as fast as get back there.

Thanks


On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:

> can you post RM/NM logs too.?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Omkar,

I'm out of office now, so I'll post it as fast as get back there.

Thanks


On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi <oj...@hortonworks.com>wrote:

> can you post RM/NM logs too.?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Omkar Joshi <oj...@hortonworks.com>.
can you post RM/NM logs too.?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Wed, Jul 10, 2013 at 6:42 AM, Andrei <fa...@gmail.com> wrote:

> If it helps, full log of AM can be found here<http://pastebin.com/zXTabyvv>
> .
>
>
> On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:
>
>> Hi Devaraj,
>>
>> thanks for your answer. Yes, I suspected it could be because of host
>> mapping, so I have already checked (and have just re-checked) settings in
>> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
>> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
>> `master-host`), so it shouldn't depend on notation too.
>>
>> I have also checked AM syslog. There's nothing about network, but there
>> are several messages like the following:
>>
>> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>>
>>
>> I understand container just doesn't get registered in AM (probably
>> because of the same issue), is it correct? So I wonder who sends "container
>> complete event" to ApplicationMaster?
>>
>>
>>
>>
>>
>> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>>
>>>  >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It is trying to connect to MRAppMaster for executing the actual task.***
>>> *
>>>
>>> ** **
>>>
>>> >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It seems Container is not getting the correct MRAppMaster address due to
>>> some reason or AM is crashing before giving the task to Container. Probably
>>> it is coming due to invalid host mapping.  Can you check the host mapping
>>> is proper in both the machines and also check the AM log that time for any
>>> clue. ****
>>>
>>> ** **
>>>
>>> Thanks****
>>>
>>> Devaraj k****
>>>
>>> ** **
>>>
>>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>>> *Sent:* 10 July 2013 17:32
>>> *To:* user@hadoop.apache.org
>>> *Subject:* ConnectionException in container, happens only sometimes****
>>>
>>> ** **
>>>
>>> Hi, ****
>>>
>>> ** **
>>>
>>> I'm running CDH4.3 installation of Hadoop with the following simple
>>> setup: ****
>>>
>>> ** **
>>>
>>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>>
>>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>>
>>> ** **
>>>
>>> When I run simple MapReduce job (both - using streaming API or Pi
>>> example from distribution) on client I see that some tasks fail: ****
>>>
>>> ** **
>>>
>>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>>
>>> ...****
>>>
>>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>>
>>> ...****
>>>
>>> ** **
>>>
>>> Every time different set of tasks/attempts fails. In some cases number
>>> of failed attempts becomes critical, and the whole job fails, in other
>>> cases job is finished successfully. I can't see any dependency, but I
>>> noticed the following. ****
>>>
>>> ** **
>>>
>>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>>> _slave-2-host_ there will be corresponding syslog with the following
>>> contents: ****
>>>
>>> ** **
>>>
>>> ... ****
>>>
>>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 1 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> ...****
>>>
>>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 9 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>>
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)****
>>>
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> ****
>>>
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> ****
>>>
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)****
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>>> ****
>>>
>>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>>
>>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)**
>>> **
>>>
>>> Caused by: java.net.ConnectException: Connection refused****
>>>
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>>
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)**
>>> **
>>>
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>>> ****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
>>> ****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)*
>>> ***
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>>
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)*
>>> ***
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>>
>>>         ... 3 more****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> Notice several things: ****
>>>
>>> ** **
>>>
>>> 1. This exception always happens on the different host than
>>> ApplicationMaster runs on. ****
>>>
>>> 2. It always tries to connect to localhost, not other host in cluster. *
>>> ***
>>>
>>> 3. Port number (11812 in this case) is always different. ****
>>>
>>> ** **
>>>
>>> My questions are: ****
>>>
>>> ** **
>>>
>>> 1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> 2. Why this error happens and how can I fix it? ****
>>>
>>> ** **
>>>
>>> Any suggestions are welcome.****
>>>
>>> ** **
>>>
>>> Thanks, ****
>>>
>>> Andrei****
>>>
>>
>>
>

Re: ConnectionException in container, happens only sometimes

Posted by Omkar Joshi <oj...@hortonworks.com>.
can you post RM/NM logs too.?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Wed, Jul 10, 2013 at 6:42 AM, Andrei <fa...@gmail.com> wrote:

> If it helps, full log of AM can be found here<http://pastebin.com/zXTabyvv>
> .
>
>
> On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:
>
>> Hi Devaraj,
>>
>> thanks for your answer. Yes, I suspected it could be because of host
>> mapping, so I have already checked (and have just re-checked) settings in
>> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
>> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
>> `master-host`), so it shouldn't depend on notation too.
>>
>> I have also checked AM syslog. There's nothing about network, but there
>> are several messages like the following:
>>
>> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>>
>>
>> I understand container just doesn't get registered in AM (probably
>> because of the same issue), is it correct? So I wonder who sends "container
>> complete event" to ApplicationMaster?
>>
>>
>>
>>
>>
>> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>>
>>>  >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It is trying to connect to MRAppMaster for executing the actual task.***
>>> *
>>>
>>> ** **
>>>
>>> >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It seems Container is not getting the correct MRAppMaster address due to
>>> some reason or AM is crashing before giving the task to Container. Probably
>>> it is coming due to invalid host mapping.  Can you check the host mapping
>>> is proper in both the machines and also check the AM log that time for any
>>> clue. ****
>>>
>>> ** **
>>>
>>> Thanks****
>>>
>>> Devaraj k****
>>>
>>> ** **
>>>
>>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>>> *Sent:* 10 July 2013 17:32
>>> *To:* user@hadoop.apache.org
>>> *Subject:* ConnectionException in container, happens only sometimes****
>>>
>>> ** **
>>>
>>> Hi, ****
>>>
>>> ** **
>>>
>>> I'm running CDH4.3 installation of Hadoop with the following simple
>>> setup: ****
>>>
>>> ** **
>>>
>>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>>
>>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>>
>>> ** **
>>>
>>> When I run simple MapReduce job (both - using streaming API or Pi
>>> example from distribution) on client I see that some tasks fail: ****
>>>
>>> ** **
>>>
>>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>>
>>> ...****
>>>
>>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>>
>>> ...****
>>>
>>> ** **
>>>
>>> Every time different set of tasks/attempts fails. In some cases number
>>> of failed attempts becomes critical, and the whole job fails, in other
>>> cases job is finished successfully. I can't see any dependency, but I
>>> noticed the following. ****
>>>
>>> ** **
>>>
>>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>>> _slave-2-host_ there will be corresponding syslog with the following
>>> contents: ****
>>>
>>> ** **
>>>
>>> ... ****
>>>
>>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 1 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> ...****
>>>
>>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 9 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>>
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)****
>>>
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> ****
>>>
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> ****
>>>
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)****
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>>> ****
>>>
>>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>>
>>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)**
>>> **
>>>
>>> Caused by: java.net.ConnectException: Connection refused****
>>>
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>>
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)**
>>> **
>>>
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>>> ****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
>>> ****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)*
>>> ***
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>>
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)*
>>> ***
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>>
>>>         ... 3 more****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> Notice several things: ****
>>>
>>> ** **
>>>
>>> 1. This exception always happens on the different host than
>>> ApplicationMaster runs on. ****
>>>
>>> 2. It always tries to connect to localhost, not other host in cluster. *
>>> ***
>>>
>>> 3. Port number (11812 in this case) is always different. ****
>>>
>>> ** **
>>>
>>> My questions are: ****
>>>
>>> ** **
>>>
>>> 1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> 2. Why this error happens and how can I fix it? ****
>>>
>>> ** **
>>>
>>> Any suggestions are welcome.****
>>>
>>> ** **
>>>
>>> Thanks, ****
>>>
>>> Andrei****
>>>
>>
>>
>

Re: ConnectionException in container, happens only sometimes

Posted by Omkar Joshi <oj...@hortonworks.com>.
can you post RM/NM logs too.?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Wed, Jul 10, 2013 at 6:42 AM, Andrei <fa...@gmail.com> wrote:

> If it helps, full log of AM can be found here<http://pastebin.com/zXTabyvv>
> .
>
>
> On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:
>
>> Hi Devaraj,
>>
>> thanks for your answer. Yes, I suspected it could be because of host
>> mapping, so I have already checked (and have just re-checked) settings in
>> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
>> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
>> `master-host`), so it shouldn't depend on notation too.
>>
>> I have also checked AM syslog. There's nothing about network, but there
>> are several messages like the following:
>>
>> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>>
>>
>> I understand container just doesn't get registered in AM (probably
>> because of the same issue), is it correct? So I wonder who sends "container
>> complete event" to ApplicationMaster?
>>
>>
>>
>>
>>
>> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>>
>>>  >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It is trying to connect to MRAppMaster for executing the actual task.***
>>> *
>>>
>>> ** **
>>>
>>> >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It seems Container is not getting the correct MRAppMaster address due to
>>> some reason or AM is crashing before giving the task to Container. Probably
>>> it is coming due to invalid host mapping.  Can you check the host mapping
>>> is proper in both the machines and also check the AM log that time for any
>>> clue. ****
>>>
>>> ** **
>>>
>>> Thanks****
>>>
>>> Devaraj k****
>>>
>>> ** **
>>>
>>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>>> *Sent:* 10 July 2013 17:32
>>> *To:* user@hadoop.apache.org
>>> *Subject:* ConnectionException in container, happens only sometimes****
>>>
>>> ** **
>>>
>>> Hi, ****
>>>
>>> ** **
>>>
>>> I'm running CDH4.3 installation of Hadoop with the following simple
>>> setup: ****
>>>
>>> ** **
>>>
>>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>>
>>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>>
>>> ** **
>>>
>>> When I run simple MapReduce job (both - using streaming API or Pi
>>> example from distribution) on client I see that some tasks fail: ****
>>>
>>> ** **
>>>
>>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>>
>>> ...****
>>>
>>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>>
>>> ...****
>>>
>>> ** **
>>>
>>> Every time different set of tasks/attempts fails. In some cases number
>>> of failed attempts becomes critical, and the whole job fails, in other
>>> cases job is finished successfully. I can't see any dependency, but I
>>> noticed the following. ****
>>>
>>> ** **
>>>
>>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>>> _slave-2-host_ there will be corresponding syslog with the following
>>> contents: ****
>>>
>>> ** **
>>>
>>> ... ****
>>>
>>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 1 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> ...****
>>>
>>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 9 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>>
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)****
>>>
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> ****
>>>
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> ****
>>>
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)****
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>>> ****
>>>
>>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>>
>>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)**
>>> **
>>>
>>> Caused by: java.net.ConnectException: Connection refused****
>>>
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>>
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)**
>>> **
>>>
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>>> ****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
>>> ****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)*
>>> ***
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>>
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)*
>>> ***
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>>
>>>         ... 3 more****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> Notice several things: ****
>>>
>>> ** **
>>>
>>> 1. This exception always happens on the different host than
>>> ApplicationMaster runs on. ****
>>>
>>> 2. It always tries to connect to localhost, not other host in cluster. *
>>> ***
>>>
>>> 3. Port number (11812 in this case) is always different. ****
>>>
>>> ** **
>>>
>>> My questions are: ****
>>>
>>> ** **
>>>
>>> 1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> 2. Why this error happens and how can I fix it? ****
>>>
>>> ** **
>>>
>>> Any suggestions are welcome.****
>>>
>>> ** **
>>>
>>> Thanks, ****
>>>
>>> Andrei****
>>>
>>
>>
>

Re: ConnectionException in container, happens only sometimes

Posted by Omkar Joshi <oj...@hortonworks.com>.
can you post RM/NM logs too.?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Wed, Jul 10, 2013 at 6:42 AM, Andrei <fa...@gmail.com> wrote:

> If it helps, full log of AM can be found here<http://pastebin.com/zXTabyvv>
> .
>
>
> On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:
>
>> Hi Devaraj,
>>
>> thanks for your answer. Yes, I suspected it could be because of host
>> mapping, so I have already checked (and have just re-checked) settings in
>> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
>> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
>> `master-host`), so it shouldn't depend on notation too.
>>
>> I have also checked AM syslog. There's nothing about network, but there
>> are several messages like the following:
>>
>> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>>
>>
>> I understand container just doesn't get registered in AM (probably
>> because of the same issue), is it correct? So I wonder who sends "container
>> complete event" to ApplicationMaster?
>>
>>
>>
>>
>>
>> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>>
>>>  >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It is trying to connect to MRAppMaster for executing the actual task.***
>>> *
>>>
>>> ** **
>>>
>>> >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It seems Container is not getting the correct MRAppMaster address due to
>>> some reason or AM is crashing before giving the task to Container. Probably
>>> it is coming due to invalid host mapping.  Can you check the host mapping
>>> is proper in both the machines and also check the AM log that time for any
>>> clue. ****
>>>
>>> ** **
>>>
>>> Thanks****
>>>
>>> Devaraj k****
>>>
>>> ** **
>>>
>>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>>> *Sent:* 10 July 2013 17:32
>>> *To:* user@hadoop.apache.org
>>> *Subject:* ConnectionException in container, happens only sometimes****
>>>
>>> ** **
>>>
>>> Hi, ****
>>>
>>> ** **
>>>
>>> I'm running CDH4.3 installation of Hadoop with the following simple
>>> setup: ****
>>>
>>> ** **
>>>
>>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>>
>>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>>
>>> ** **
>>>
>>> When I run simple MapReduce job (both - using streaming API or Pi
>>> example from distribution) on client I see that some tasks fail: ****
>>>
>>> ** **
>>>
>>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>>
>>> ...****
>>>
>>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>>
>>> ...****
>>>
>>> ** **
>>>
>>> Every time different set of tasks/attempts fails. In some cases number
>>> of failed attempts becomes critical, and the whole job fails, in other
>>> cases job is finished successfully. I can't see any dependency, but I
>>> noticed the following. ****
>>>
>>> ** **
>>>
>>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>>> _slave-2-host_ there will be corresponding syslog with the following
>>> contents: ****
>>>
>>> ** **
>>>
>>> ... ****
>>>
>>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 1 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> ...****
>>>
>>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 9 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>>
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)****
>>>
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> ****
>>>
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> ****
>>>
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)****
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>>> ****
>>>
>>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>>
>>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)**
>>> **
>>>
>>> Caused by: java.net.ConnectException: Connection refused****
>>>
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>>
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)**
>>> **
>>>
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>>> ****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
>>> ****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)*
>>> ***
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>>
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)*
>>> ***
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>>
>>>         ... 3 more****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> Notice several things: ****
>>>
>>> ** **
>>>
>>> 1. This exception always happens on the different host than
>>> ApplicationMaster runs on. ****
>>>
>>> 2. It always tries to connect to localhost, not other host in cluster. *
>>> ***
>>>
>>> 3. Port number (11812 in this case) is always different. ****
>>>
>>> ** **
>>>
>>> My questions are: ****
>>>
>>> ** **
>>>
>>> 1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> 2. Why this error happens and how can I fix it? ****
>>>
>>> ** **
>>>
>>> Any suggestions are welcome.****
>>>
>>> ** **
>>>
>>> Thanks, ****
>>>
>>> Andrei****
>>>
>>
>>
>

EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name   default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name      OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name   default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name      OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name   default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name      OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

EBADF: Bad file descriptor

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
Name    Value
impression.log.record.cached.tag        cached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path       /workflows/impressions/config/aggregations.conf<http://thv01:8888/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name   default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir        /data/input/impressionlogs/outpdirlogs/9999-99-99<http://thv01:8888/filebrowser/view/data/input/impressionlogs/outpdirlogs/9999-99-99>
mapreduce.job.inputformat.class com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps      500
mapreduce.job.name      OutpdirImpressions_0000475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.class        org.apache.hadoop.io.Text
mapreduce.job.outputformat.class        com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class      com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec     org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class        com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress      true
mapreduce.output.fileoutputformat.compress.codec        com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir     /data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:8888/filebrowser/view/data/output/impressions/outpdir/9999-99-99/0000475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum      8
outpdir.log.exclude.processing.datatypes        header,sellerhidden




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
If it helps, full log of AM can be found here <http://pastebin.com/zXTabyvv>
.


On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:

> Hi Devaraj,
>
> thanks for your answer. Yes, I suspected it could be because of host
> mapping, so I have already checked (and have just re-checked) settings in
> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
> `master-host`), so it shouldn't depend on notation too.
>
> I have also checked AM syslog. There's nothing about network, but there
> are several messages like the following:
>
> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>
>
> I understand container just doesn't get registered in AM (probably because
> of the same issue), is it correct? So I wonder who sends "container
> complete event" to ApplicationMaster?
>
>
>
>
>
> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>
>>  >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It is trying to connect to MRAppMaster for executing the actual task.****
>>
>> ** **
>>
>> >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It seems Container is not getting the correct MRAppMaster address due to
>> some reason or AM is crashing before giving the task to Container. Probably
>> it is coming due to invalid host mapping.  Can you check the host mapping
>> is proper in both the machines and also check the AM log that time for any
>> clue. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>> *Sent:* 10 July 2013 17:32
>> *To:* user@hadoop.apache.org
>> *Subject:* ConnectionException in container, happens only sometimes****
>>
>> ** **
>>
>> Hi, ****
>>
>> ** **
>>
>> I'm running CDH4.3 installation of Hadoop with the following simple
>> setup: ****
>>
>> ** **
>>
>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>
>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>
>> ** **
>>
>> When I run simple MapReduce job (both - using streaming API or Pi example
>> from distribution) on client I see that some tasks fail: ****
>>
>> ** **
>>
>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>
>> ...****
>>
>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>
>> ...****
>>
>> ** **
>>
>> Every time different set of tasks/attempts fails. In some cases number of
>> failed attempts becomes critical, and the whole job fails, in other cases
>> job is finished successfully. I can't see any dependency, but I noticed the
>> following. ****
>>
>> ** **
>>
>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>> _slave-2-host_ there will be corresponding syslog with the following
>> contents: ****
>>
>> ** **
>>
>> ... ****
>>
>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> ...****
>>
>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 9 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>> java.net.ConnectException: Connection refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)****
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> ****
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> ****
>>
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> ****
>>
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
>> ****
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>
>>         at
>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>> ****
>>
>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>
>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)***
>> *
>>
>> Caused by: java.net.ConnectException: Connection refused****
>>
>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>
>>         at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)***
>> *
>>
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>> ****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)*
>> ***
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)**
>> **
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)**
>> **
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>
>>         ... 3 more****
>>
>> ** **
>>
>> ** **
>>
>> Notice several things: ****
>>
>> ** **
>>
>> 1. This exception always happens on the different host than
>> ApplicationMaster runs on. ****
>>
>> 2. It always tries to connect to localhost, not other host in cluster. **
>> **
>>
>> 3. Port number (11812 in this case) is always different. ****
>>
>> ** **
>>
>> My questions are: ****
>>
>> ** **
>>
>> 1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> 2. Why this error happens and how can I fix it? ****
>>
>> ** **
>>
>> Any suggestions are welcome.****
>>
>> ** **
>>
>> Thanks, ****
>>
>> Andrei****
>>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
If it helps, full log of AM can be found here <http://pastebin.com/zXTabyvv>
.


On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:

> Hi Devaraj,
>
> thanks for your answer. Yes, I suspected it could be because of host
> mapping, so I have already checked (and have just re-checked) settings in
> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
> `master-host`), so it shouldn't depend on notation too.
>
> I have also checked AM syslog. There's nothing about network, but there
> are several messages like the following:
>
> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>
>
> I understand container just doesn't get registered in AM (probably because
> of the same issue), is it correct? So I wonder who sends "container
> complete event" to ApplicationMaster?
>
>
>
>
>
> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>
>>  >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It is trying to connect to MRAppMaster for executing the actual task.****
>>
>> ** **
>>
>> >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It seems Container is not getting the correct MRAppMaster address due to
>> some reason or AM is crashing before giving the task to Container. Probably
>> it is coming due to invalid host mapping.  Can you check the host mapping
>> is proper in both the machines and also check the AM log that time for any
>> clue. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>> *Sent:* 10 July 2013 17:32
>> *To:* user@hadoop.apache.org
>> *Subject:* ConnectionException in container, happens only sometimes****
>>
>> ** **
>>
>> Hi, ****
>>
>> ** **
>>
>> I'm running CDH4.3 installation of Hadoop with the following simple
>> setup: ****
>>
>> ** **
>>
>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>
>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>
>> ** **
>>
>> When I run simple MapReduce job (both - using streaming API or Pi example
>> from distribution) on client I see that some tasks fail: ****
>>
>> ** **
>>
>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>
>> ...****
>>
>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>
>> ...****
>>
>> ** **
>>
>> Every time different set of tasks/attempts fails. In some cases number of
>> failed attempts becomes critical, and the whole job fails, in other cases
>> job is finished successfully. I can't see any dependency, but I noticed the
>> following. ****
>>
>> ** **
>>
>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>> _slave-2-host_ there will be corresponding syslog with the following
>> contents: ****
>>
>> ** **
>>
>> ... ****
>>
>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> ...****
>>
>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 9 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>> java.net.ConnectException: Connection refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)****
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> ****
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> ****
>>
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> ****
>>
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
>> ****
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>
>>         at
>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>> ****
>>
>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>
>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)***
>> *
>>
>> Caused by: java.net.ConnectException: Connection refused****
>>
>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>
>>         at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)***
>> *
>>
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>> ****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)*
>> ***
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)**
>> **
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)**
>> **
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>
>>         ... 3 more****
>>
>> ** **
>>
>> ** **
>>
>> Notice several things: ****
>>
>> ** **
>>
>> 1. This exception always happens on the different host than
>> ApplicationMaster runs on. ****
>>
>> 2. It always tries to connect to localhost, not other host in cluster. **
>> **
>>
>> 3. Port number (11812 in this case) is always different. ****
>>
>> ** **
>>
>> My questions are: ****
>>
>> ** **
>>
>> 1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> 2. Why this error happens and how can I fix it? ****
>>
>> ** **
>>
>> Any suggestions are welcome.****
>>
>> ** **
>>
>> Thanks, ****
>>
>> Andrei****
>>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
If it helps, full log of AM can be found here <http://pastebin.com/zXTabyvv>
.


On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:

> Hi Devaraj,
>
> thanks for your answer. Yes, I suspected it could be because of host
> mapping, so I have already checked (and have just re-checked) settings in
> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
> `master-host`), so it shouldn't depend on notation too.
>
> I have also checked AM syslog. There's nothing about network, but there
> are several messages like the following:
>
> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>
>
> I understand container just doesn't get registered in AM (probably because
> of the same issue), is it correct? So I wonder who sends "container
> complete event" to ApplicationMaster?
>
>
>
>
>
> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>
>>  >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It is trying to connect to MRAppMaster for executing the actual task.****
>>
>> ** **
>>
>> >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It seems Container is not getting the correct MRAppMaster address due to
>> some reason or AM is crashing before giving the task to Container. Probably
>> it is coming due to invalid host mapping.  Can you check the host mapping
>> is proper in both the machines and also check the AM log that time for any
>> clue. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>> *Sent:* 10 July 2013 17:32
>> *To:* user@hadoop.apache.org
>> *Subject:* ConnectionException in container, happens only sometimes****
>>
>> ** **
>>
>> Hi, ****
>>
>> ** **
>>
>> I'm running CDH4.3 installation of Hadoop with the following simple
>> setup: ****
>>
>> ** **
>>
>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>
>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>
>> ** **
>>
>> When I run simple MapReduce job (both - using streaming API or Pi example
>> from distribution) on client I see that some tasks fail: ****
>>
>> ** **
>>
>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>
>> ...****
>>
>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>
>> ...****
>>
>> ** **
>>
>> Every time different set of tasks/attempts fails. In some cases number of
>> failed attempts becomes critical, and the whole job fails, in other cases
>> job is finished successfully. I can't see any dependency, but I noticed the
>> following. ****
>>
>> ** **
>>
>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>> _slave-2-host_ there will be corresponding syslog with the following
>> contents: ****
>>
>> ** **
>>
>> ... ****
>>
>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> ...****
>>
>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 9 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>> java.net.ConnectException: Connection refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)****
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> ****
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> ****
>>
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> ****
>>
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
>> ****
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>
>>         at
>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>> ****
>>
>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>
>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)***
>> *
>>
>> Caused by: java.net.ConnectException: Connection refused****
>>
>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>
>>         at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)***
>> *
>>
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>> ****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)*
>> ***
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)**
>> **
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)**
>> **
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>
>>         ... 3 more****
>>
>> ** **
>>
>> ** **
>>
>> Notice several things: ****
>>
>> ** **
>>
>> 1. This exception always happens on the different host than
>> ApplicationMaster runs on. ****
>>
>> 2. It always tries to connect to localhost, not other host in cluster. **
>> **
>>
>> 3. Port number (11812 in this case) is always different. ****
>>
>> ** **
>>
>> My questions are: ****
>>
>> ** **
>>
>> 1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> 2. Why this error happens and how can I fix it? ****
>>
>> ** **
>>
>> Any suggestions are welcome.****
>>
>> ** **
>>
>> Thanks, ****
>>
>> Andrei****
>>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
If it helps, full log of AM can be found here <http://pastebin.com/zXTabyvv>
.


On Wed, Jul 10, 2013 at 4:21 PM, Andrei <fa...@gmail.com> wrote:

> Hi Devaraj,
>
> thanks for your answer. Yes, I suspected it could be because of host
> mapping, so I have already checked (and have just re-checked) settings in
> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
> `master-host`), so it shouldn't depend on notation too.
>
> I have also checked AM syslog. There's nothing about network, but there
> are several messages like the following:
>
> ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088
>
>
> I understand container just doesn't get registered in AM (probably because
> of the same issue), is it correct? So I wonder who sends "container
> complete event" to ApplicationMaster?
>
>
>
>
>
> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:
>
>>  >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It is trying to connect to MRAppMaster for executing the actual task.****
>>
>> ** **
>>
>> >1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> It seems Container is not getting the correct MRAppMaster address due to
>> some reason or AM is crashing before giving the task to Container. Probably
>> it is coming due to invalid host mapping.  Can you check the host mapping
>> is proper in both the machines and also check the AM log that time for any
>> clue. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* Andrei [mailto:faithlessfriend@gmail.com]
>> *Sent:* 10 July 2013 17:32
>> *To:* user@hadoop.apache.org
>> *Subject:* ConnectionException in container, happens only sometimes****
>>
>> ** **
>>
>> Hi, ****
>>
>> ** **
>>
>> I'm running CDH4.3 installation of Hadoop with the following simple
>> setup: ****
>>
>> ** **
>>
>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>
>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>
>> ** **
>>
>> When I run simple MapReduce job (both - using streaming API or Pi example
>> from distribution) on client I see that some tasks fail: ****
>>
>> ** **
>>
>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>
>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>
>> ...****
>>
>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>
>> ...****
>>
>> ** **
>>
>> Every time different set of tasks/attempts fails. In some cases number of
>> failed attempts becomes critical, and the whole job fails, in other cases
>> job is finished successfully. I can't see any dependency, but I noticed the
>> following. ****
>>
>> ** **
>>
>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>> _slave-2-host_ there will be corresponding syslog with the following
>> contents: ****
>>
>> ** **
>>
>> ... ****
>>
>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> ...****
>>
>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>> 9 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
>> **
>>
>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>> java.net.ConnectException: Connection refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)****
>>
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> ****
>>
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> ****
>>
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> ****
>>
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
>> ****
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>
>>         at
>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>> ****
>>
>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>
>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)***
>> *
>>
>> Caused by: java.net.ConnectException: Connection refused****
>>
>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>
>>         at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)***
>> *
>>
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>> ****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)*
>> ***
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)**
>> **
>>
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)**
>> **
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>
>>         ... 3 more****
>>
>> ** **
>>
>> ** **
>>
>> Notice several things: ****
>>
>> ** **
>>
>> 1. This exception always happens on the different host than
>> ApplicationMaster runs on. ****
>>
>> 2. It always tries to connect to localhost, not other host in cluster. **
>> **
>>
>> 3. Port number (11812 in this case) is always different. ****
>>
>> ** **
>>
>> My questions are: ****
>>
>> ** **
>>
>> 1. I assume this is the task (container) that tries to establish
>> connection, but what it wants to connect to? ****
>>
>> 2. Why this error happens and how can I fix it? ****
>>
>> ** **
>>
>> Any suggestions are welcome.****
>>
>> ** **
>>
>> Thanks, ****
>>
>> Andrei****
>>
>
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Devaraj,

thanks for your answer. Yes, I suspected it could be because of host
mapping, so I have already checked (and have just re-checked) settings in
/etc/hosts of each machine, and they all are ok. I use both fully-qualified
names (e.g. `master-host.company.com`) and their shortcuts (e.g.
`master-host`), so it shouldn't depend on notation too.

I have also checked AM syslog. There's nothing about network, but there are
several messages like the following:

ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container
complete event for unknown container id
container_1373460572360_0001_01_000088


I understand container just doesn't get registered in AM (probably because
of the same issue), is it correct? So I wonder who sends "container
complete event" to ApplicationMaster?





On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:

>  >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It is trying to connect to MRAppMaster for executing the actual task.****
>
> ** **
>
> >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It seems Container is not getting the correct MRAppMaster address due to
> some reason or AM is crashing before giving the task to Container. Probably
> it is coming due to invalid host mapping.  Can you check the host mapping
> is proper in both the machines and also check the AM log that time for any
> clue. ****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Andrei [mailto:faithlessfriend@gmail.com]
> *Sent:* 10 July 2013 17:32
> *To:* user@hadoop.apache.org
> *Subject:* ConnectionException in container, happens only sometimes****
>
> ** **
>
> Hi, ****
>
> ** **
>
> I'm running CDH4.3 installation of Hadoop with the following simple setup:
> ****
>
> ** **
>
> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>
> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>
> ** **
>
> When I run simple MapReduce job (both - using streaming API or Pi example
> from distribution) on client I see that some tasks fail: ****
>
> ** **
>
> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>
> ...****
>
> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>
> ...****
>
> ** **
>
> Every time different set of tasks/attempts fails. In some cases number of
> failed attempts becomes critical, and the whole job fails, in other cases
> job is finished successfully. I can't see any dependency, but I noticed the
> following. ****
>
> ** **
>
> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
> _slave-2-host_ there will be corresponding syslog with the following
> contents: ****
>
> ** **
>
> ... ****
>
> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> ...****
>
> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.net.ConnectException: Call From slave-2-host/
> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused****
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)****
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> ****
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ****
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> ****
>
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
> ****
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>
>         at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
> ****
>
>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)****
>
> Caused by: java.net.ConnectException: Connection refused****
>
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)****
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
> ****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)**
> **
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)***
> *
>
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)***
> *
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>
>         ... 3 more****
>
> ** **
>
> ** **
>
> Notice several things: ****
>
> ** **
>
> 1. This exception always happens on the different host than
> ApplicationMaster runs on. ****
>
> 2. It always tries to connect to localhost, not other host in cluster. ***
> *
>
> 3. Port number (11812 in this case) is always different. ****
>
> ** **
>
> My questions are: ****
>
> ** **
>
> 1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> 2. Why this error happens and how can I fix it? ****
>
> ** **
>
> Any suggestions are welcome.****
>
> ** **
>
> Thanks, ****
>
> Andrei****
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Devaraj,

thanks for your answer. Yes, I suspected it could be because of host
mapping, so I have already checked (and have just re-checked) settings in
/etc/hosts of each machine, and they all are ok. I use both fully-qualified
names (e.g. `master-host.company.com`) and their shortcuts (e.g.
`master-host`), so it shouldn't depend on notation too.

I have also checked AM syslog. There's nothing about network, but there are
several messages like the following:

ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container
complete event for unknown container id
container_1373460572360_0001_01_000088


I understand container just doesn't get registered in AM (probably because
of the same issue), is it correct? So I wonder who sends "container
complete event" to ApplicationMaster?





On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:

>  >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It is trying to connect to MRAppMaster for executing the actual task.****
>
> ** **
>
> >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It seems Container is not getting the correct MRAppMaster address due to
> some reason or AM is crashing before giving the task to Container. Probably
> it is coming due to invalid host mapping.  Can you check the host mapping
> is proper in both the machines and also check the AM log that time for any
> clue. ****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Andrei [mailto:faithlessfriend@gmail.com]
> *Sent:* 10 July 2013 17:32
> *To:* user@hadoop.apache.org
> *Subject:* ConnectionException in container, happens only sometimes****
>
> ** **
>
> Hi, ****
>
> ** **
>
> I'm running CDH4.3 installation of Hadoop with the following simple setup:
> ****
>
> ** **
>
> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>
> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>
> ** **
>
> When I run simple MapReduce job (both - using streaming API or Pi example
> from distribution) on client I see that some tasks fail: ****
>
> ** **
>
> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>
> ...****
>
> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>
> ...****
>
> ** **
>
> Every time different set of tasks/attempts fails. In some cases number of
> failed attempts becomes critical, and the whole job fails, in other cases
> job is finished successfully. I can't see any dependency, but I noticed the
> following. ****
>
> ** **
>
> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
> _slave-2-host_ there will be corresponding syslog with the following
> contents: ****
>
> ** **
>
> ... ****
>
> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> ...****
>
> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.net.ConnectException: Call From slave-2-host/
> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused****
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)****
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> ****
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ****
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> ****
>
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
> ****
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>
>         at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
> ****
>
>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)****
>
> Caused by: java.net.ConnectException: Connection refused****
>
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)****
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
> ****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)**
> **
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)***
> *
>
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)***
> *
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>
>         ... 3 more****
>
> ** **
>
> ** **
>
> Notice several things: ****
>
> ** **
>
> 1. This exception always happens on the different host than
> ApplicationMaster runs on. ****
>
> 2. It always tries to connect to localhost, not other host in cluster. ***
> *
>
> 3. Port number (11812 in this case) is always different. ****
>
> ** **
>
> My questions are: ****
>
> ** **
>
> 1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> 2. Why this error happens and how can I fix it? ****
>
> ** **
>
> Any suggestions are welcome.****
>
> ** **
>
> Thanks, ****
>
> Andrei****
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Devaraj,

thanks for your answer. Yes, I suspected it could be because of host
mapping, so I have already checked (and have just re-checked) settings in
/etc/hosts of each machine, and they all are ok. I use both fully-qualified
names (e.g. `master-host.company.com`) and their shortcuts (e.g.
`master-host`), so it shouldn't depend on notation too.

I have also checked AM syslog. There's nothing about network, but there are
several messages like the following:

ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container
complete event for unknown container id
container_1373460572360_0001_01_000088


I understand container just doesn't get registered in AM (probably because
of the same issue), is it correct? So I wonder who sends "container
complete event" to ApplicationMaster?





On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:

>  >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It is trying to connect to MRAppMaster for executing the actual task.****
>
> ** **
>
> >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It seems Container is not getting the correct MRAppMaster address due to
> some reason or AM is crashing before giving the task to Container. Probably
> it is coming due to invalid host mapping.  Can you check the host mapping
> is proper in both the machines and also check the AM log that time for any
> clue. ****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Andrei [mailto:faithlessfriend@gmail.com]
> *Sent:* 10 July 2013 17:32
> *To:* user@hadoop.apache.org
> *Subject:* ConnectionException in container, happens only sometimes****
>
> ** **
>
> Hi, ****
>
> ** **
>
> I'm running CDH4.3 installation of Hadoop with the following simple setup:
> ****
>
> ** **
>
> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>
> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>
> ** **
>
> When I run simple MapReduce job (both - using streaming API or Pi example
> from distribution) on client I see that some tasks fail: ****
>
> ** **
>
> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>
> ...****
>
> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>
> ...****
>
> ** **
>
> Every time different set of tasks/attempts fails. In some cases number of
> failed attempts becomes critical, and the whole job fails, in other cases
> job is finished successfully. I can't see any dependency, but I noticed the
> following. ****
>
> ** **
>
> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
> _slave-2-host_ there will be corresponding syslog with the following
> contents: ****
>
> ** **
>
> ... ****
>
> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> ...****
>
> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.net.ConnectException: Call From slave-2-host/
> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused****
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)****
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> ****
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ****
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> ****
>
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
> ****
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>
>         at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
> ****
>
>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)****
>
> Caused by: java.net.ConnectException: Connection refused****
>
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)****
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
> ****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)**
> **
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)***
> *
>
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)***
> *
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>
>         ... 3 more****
>
> ** **
>
> ** **
>
> Notice several things: ****
>
> ** **
>
> 1. This exception always happens on the different host than
> ApplicationMaster runs on. ****
>
> 2. It always tries to connect to localhost, not other host in cluster. ***
> *
>
> 3. Port number (11812 in this case) is always different. ****
>
> ** **
>
> My questions are: ****
>
> ** **
>
> 1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> 2. Why this error happens and how can I fix it? ****
>
> ** **
>
> Any suggestions are welcome.****
>
> ** **
>
> Thanks, ****
>
> Andrei****
>

Re: ConnectionException in container, happens only sometimes

Posted by Andrei <fa...@gmail.com>.
Hi Devaraj,

thanks for your answer. Yes, I suspected it could be because of host
mapping, so I have already checked (and have just re-checked) settings in
/etc/hosts of each machine, and they all are ok. I use both fully-qualified
names (e.g. `master-host.company.com`) and their shortcuts (e.g.
`master-host`), so it shouldn't depend on notation too.

I have also checked AM syslog. There's nothing about network, but there are
several messages like the following:

ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container
complete event for unknown container id
container_1373460572360_0001_01_000088


I understand container just doesn't get registered in AM (probably because
of the same issue), is it correct? So I wonder who sends "container
complete event" to ApplicationMaster?





On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <de...@huawei.com> wrote:

>  >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It is trying to connect to MRAppMaster for executing the actual task.****
>
> ** **
>
> >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It seems Container is not getting the correct MRAppMaster address due to
> some reason or AM is crashing before giving the task to Container. Probably
> it is coming due to invalid host mapping.  Can you check the host mapping
> is proper in both the machines and also check the AM log that time for any
> clue. ****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Andrei [mailto:faithlessfriend@gmail.com]
> *Sent:* 10 July 2013 17:32
> *To:* user@hadoop.apache.org
> *Subject:* ConnectionException in container, happens only sometimes****
>
> ** **
>
> Hi, ****
>
> ** **
>
> I'm running CDH4.3 installation of Hadoop with the following simple setup:
> ****
>
> ** **
>
> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>
> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>
> ** **
>
> When I run simple MapReduce job (both - using streaming API or Pi example
> from distribution) on client I see that some tasks fail: ****
>
> ** **
>
> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>
> ...****
>
> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>
> ...****
>
> ** **
>
> Every time different set of tasks/attempts fails. In some cases number of
> failed attempts becomes critical, and the whole job fails, in other cases
> job is finished successfully. I can't see any dependency, but I noticed the
> following. ****
>
> ** **
>
> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
> _slave-2-host_ there will be corresponding syslog with the following
> contents: ****
>
> ** **
>
> ... ****
>
> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> ...****
>
> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.net.ConnectException: Call From slave-2-host/
> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused****
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)****
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> ****
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ****
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> ****
>
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
> ****
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>
>         at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
> ****
>
>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)****
>
> Caused by: java.net.ConnectException: Connection refused****
>
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)****
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
> ****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)**
> **
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)***
> *
>
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)***
> *
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>
>         ... 3 more****
>
> ** **
>
> ** **
>
> Notice several things: ****
>
> ** **
>
> 1. This exception always happens on the different host than
> ApplicationMaster runs on. ****
>
> 2. It always tries to connect to localhost, not other host in cluster. ***
> *
>
> 3. Port number (11812 in this case) is always different. ****
>
> ** **
>
> My questions are: ****
>
> ** **
>
> 1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> 2. Why this error happens and how can I fix it? ****
>
> ** **
>
> Any suggestions are welcome.****
>
> ** **
>
> Thanks, ****
>
> Andrei****
>

RE: ConnectionException in container, happens only sometimes

Posted by Devaraj k <de...@huawei.com>.
>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It is trying to connect to MRAppMaster for executing the actual task.

>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping.  Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue.

Thanks
Devaraj k

From: Andrei [mailto:faithlessfriend@gmail.com]
Sent: 10 July 2013 17:32
To: user@hadoop.apache.org
Subject: ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/127.0.0.1<http://127.0.0.1> to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
        at com.sun.proxy.$Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
        at org.apache.hadoop.ipc.Client.call(Client.java:1196)
        ... 3 more


Notice several things:

1. This exception always happens on the different host than ApplicationMaster runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
2. Why this error happens and how can I fix it?

Any suggestions are welcome.

Thanks,
Andrei

RE: ConnectionException in container, happens only sometimes

Posted by Devaraj k <de...@huawei.com>.
>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It is trying to connect to MRAppMaster for executing the actual task.

>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping.  Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue.

Thanks
Devaraj k

From: Andrei [mailto:faithlessfriend@gmail.com]
Sent: 10 July 2013 17:32
To: user@hadoop.apache.org
Subject: ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/127.0.0.1<http://127.0.0.1> to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
        at com.sun.proxy.$Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
        at org.apache.hadoop.ipc.Client.call(Client.java:1196)
        ... 3 more


Notice several things:

1. This exception always happens on the different host than ApplicationMaster runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
2. Why this error happens and how can I fix it?

Any suggestions are welcome.

Thanks,
Andrei

RE: ConnectionException in container, happens only sometimes

Posted by Devaraj k <de...@huawei.com>.
>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It is trying to connect to MRAppMaster for executing the actual task.

>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping.  Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue.

Thanks
Devaraj k

From: Andrei [mailto:faithlessfriend@gmail.com]
Sent: 10 July 2013 17:32
To: user@hadoop.apache.org
Subject: ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/127.0.0.1<http://127.0.0.1> to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
        at com.sun.proxy.$Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
        at org.apache.hadoop.ipc.Client.call(Client.java:1196)
        ... 3 more


Notice several things:

1. This exception always happens on the different host than ApplicationMaster runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
2. Why this error happens and how can I fix it?

Any suggestions are welcome.

Thanks,
Andrei

RE: ConnectionException in container, happens only sometimes

Posted by Devaraj k <de...@huawei.com>.
>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It is trying to connect to MRAppMaster for executing the actual task.

>1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
It seems Container is not getting the correct MRAppMaster address due to some reason or AM is crashing before giving the task to Container. Probably it is coming due to invalid host mapping.  Can you check the host mapping is proper in both the machines and also check the AM log that time for any clue.

Thanks
Devaraj k

From: Andrei [mailto:faithlessfriend@gmail.com]
Sent: 10 July 2013 17:32
To: user@hadoop.apache.org
Subject: ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812<http://127.0.0.1:11812>. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/127.0.0.1<http://127.0.0.1> to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
        at com.sun.proxy.$Proxy6.getTask(Unknown Source)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
        at org.apache.hadoop.ipc.Client.call(Client.java:1196)
        ... 3 more


Notice several things:

1. This exception always happens on the different host than ApplicationMaster runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to?
2. Why this error happens and how can I fix it?

Any suggestions are welcome.

Thanks,
Andrei