You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Rafael (Created) (JIRA)" <ji...@apache.org> on 2012/03/19 19:21:38 UTC

[jira] [Created] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

reduce speculation on but ParseOutputFormat doesn't name output files correctly?
--------------------------------------------------------------------------------

                 Key: NUTCH-1315
                 URL: https://issues.apache.org/jira/browse/NUTCH-1315
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.4
         Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls
            Reporter: Rafael


>From time to time the Reducer log contains the following and one tasktracker gets blacklisted.

org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-00001/data for DFSClient_attempt_201203151054_0028_r_000001_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_000001_0 on xx.xx.xx.9
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

	at org.apache.hadoop.ipc.Client.call(Client.java:1066)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
	at $Proxy2.create(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy2.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3245)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
	at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1132)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:157)
	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:134)
	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:92)
	at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
	at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:448)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

I asked the hdfs-user mailing list and i got the following answer:

"Looks like you have reduce speculation turned on, but the
ParseOutputFormat you're using doesn't properly name its output files
distinctly based on the task attempt ID. As a workaround you can
probably turn off speculative execution for reduces, but you should
also probably file a Nutch bug."



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232836#comment-13232836 ] 

Lewis John McGibbney commented on NUTCH-1315:
---------------------------------------------

Regarding your comment e.g. does not turn on reduce speculation, my initial thought it no. I will try to confirm/iron out. Do you have any speculation settings configured for Hadoop at all?
                
> reduce speculation on but ParseOutputFormat doesn't name output files correctly?
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1315
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1315
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>         Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls
>            Reporter: Rafael
>              Labels: hadoop, hdfs
>
> From time to time the Reducer log contains the following and one tasktracker gets blacklisted.
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-00001/data for DFSClient_attempt_201203151054_0028_r_000001_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_000001_0 on xx.xx.xx.9
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
> 	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1066)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> 	at $Proxy2.create(Unknown Source)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> 	at $Proxy2.create(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3245)
> 	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
> 	at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1132)
> 	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
> 	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
> 	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
> 	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:157)
> 	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:134)
> 	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:92)
> 	at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
> 	at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:448)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I asked the hdfs-user mailing list and i got the following answer:
> "Looks like you have reduce speculation turned on, but the
> ParseOutputFormat you're using doesn't properly name its output files
> distinctly based on the task attempt ID. As a workaround you can
> probably turn off speculative execution for reduces, but you should
> also probably file a Nutch bug."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232914#comment-13232914 ] 

Markus Jelsma commented on NUTCH-1315:
--------------------------------------

Speculative task execution is enabled by default but the fetch and index jobs disable them. We have disabled speculative execution altogether at some point only because we need those slots to be free for other jobs.

Should extended OutputFormat's take care of this? It isn't clear in MapRed's API docs whether this is a problem. The name parameter is to be unique for the task's part of the output for the entire job, which it is.

Wouldn't including a task ID in the output name cause a mess in the final output?

In the mean time i would indeed disable speculative execution. In my opinion and experience with Nutch and other jobs it's not really worth it. It takes empty slots that you can use for other jobs and if there are no other jobs it still takes additional CPU cycles and RAM and disk I/O for a few seconds. I must add that our network is homogenous (fallacy) and all nodes have almost equal load.
                
> reduce speculation on but ParseOutputFormat doesn't name output files correctly?
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1315
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1315
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>         Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls
>            Reporter: Rafael
>              Labels: hadoop, hdfs
>
> From time to time the Reducer log contains the following and one tasktracker gets blacklisted.
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-00001/data for DFSClient_attempt_201203151054_0028_r_000001_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_000001_0 on xx.xx.xx.9
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
> 	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1066)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> 	at $Proxy2.create(Unknown Source)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> 	at $Proxy2.create(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3245)
> 	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
> 	at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1132)
> 	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
> 	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
> 	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
> 	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:157)
> 	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:134)
> 	at org.apache.hadoop.io.MapFile$Writer.<init>(MapFile.java:92)
> 	at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110)
> 	at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:448)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I asked the hdfs-user mailing list and i got the following answer:
> "Looks like you have reduce speculation turned on, but the
> ParseOutputFormat you're using doesn't properly name its output files
> distinctly based on the task attempt ID. As a workaround you can
> probably turn off speculative execution for reduces, but you should
> also probably file a Nutch bug."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira