You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "julien nioche (JIRA)" <ji...@apache.org> on 2009/02/18 13:31:04 UTC

[jira] Created: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

AlreadyBeingCreatedException with Hadoop 0.19
---------------------------------------------

                 Key: NUTCH-692
                 URL: https://issues.apache.org/jira/browse/NUTCH-692
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.0.0
            Reporter: julien nioche


I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.

There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19

I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  

J.  



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695122#action_12695122 ] 

Doğacan Güney commented on NUTCH-692:
-------------------------------------

Thanks for the patch.

Patch looks good to me. Can you confirm if this fixes the problem (or tell me how to trigger the problem without patch)?

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717058#action_12717058 ] 

Doğacan Güney commented on NUTCH-692:
-------------------------------------

Sorry for the late answer.

Since Julien confirmed that it fixes the problem, I will commit this patch if there are no objections.

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Cosmin Lehene (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene updated NUTCH-692:
--------------------------------

    Attachment: NUTCH-692.patch

This just checks the destination file existence before attempting to create a new output MapFile for the reduce task in the FetcherOutputFormat and ParseOutputFormat. If the destination files exist it deletes them. 
The AlreadyBeingCreatedException is thrown when a MapFile creation attempt fails to create the same file as the previous failed task. 


> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "julien nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674607#action_12674607 ] 

julien nioche commented on NUTCH-692:
-------------------------------------

I have seen this only in multinode setup and on EC2.

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche reassigned NUTCH-692:
-----------------------------------

    Assignee: Julien Nioche

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-692.
---------------------------------

       Resolution: Cannot Reproduce
    Fix Version/s: 1.1

I cannot reproduce the issue since we moved to the Hadoop 0.20., which is good news

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.1
>
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674603#action_12674603 ] 

Sami Siren commented on NUTCH-692:
----------------------------------

Have you seen this outside of EC2? Only in multinode setup?

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783612#action_12783612 ] 

Julien Nioche commented on NUTCH-692:
-------------------------------------

Ok let's leave it open for now

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696958#action_12696958 ] 

Julien Nioche commented on NUTCH-692:
-------------------------------------

I haven't had the time to try it on the SVN version.  Will try to do so when I have more time. Thanks!

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Cosmin Lehene (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694703#action_12694703 ] 

Cosmin Lehene commented on NUTCH-692:
-------------------------------------

The AlreadyBeingCreatedException appears when a reduce task fails at a first attempt and leaves the output files open for the next. I have a patch for it. The reduce task won't stop with an AlreadyBeingCreatedException on the second run. However this is sometimes caused by other bugs - on of them being the regexp match hang caused by a Java Regex bug and even if you won't get the AlreadyBeingCreatedException you still need to deal with the regexp infinite loop. 

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Cosmin Lehene (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695300#action_12695300 ] 

Cosmin Lehene commented on NUTCH-692:
-------------------------------------

The AlreadyBeingCreatedException appears at the second attempt for a reduce task. In our case, in the reduce phase of the fetch process sometimes we had hanging processes that initially reported something like "Task attempt_200903031109_0007_r_000002_0 failed to report status for 603 seconds. Killing!". When the task is retried the ParseOutputFormat, FetcherOutputFormat try to create the MapFile, but this already exists so the task fails again with AlreadyBeingCreatedException and there's no way to recover unless the files are deleted. 

The patch fixes the issue with a second reduce attempt, and yes, it works. Without the patch there will be no second reduce attempt since it will stop at 
new MapFile.Writer(job,...) with the AlreadyBeingCreatedException

It's not trivial to reproduce the "failed to report status for X seconds. Killing!" problem, unless you have some bad regexp to feed the crawler with :). However I believe it could be reproduced by stopping and starting the tasktracker with hadoop-daemon.sh stop/start tasktracker. 

Another way to reproduce just the HDFS exception is to try to create the same file twice. 
 
 However it should be known that there are many reasons for "failed to report status for 603 seconds. Killing!". One of them is due to the regex problem stated above when the regex.match process loops forever taking 100% of the CPU. If for some reason the reduce task will hit a problem like this this patch won't help much, except that it will let the reduce phase go through the entire reducer process again and not fail when it starts. 

    

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783302#action_12783302 ] 

Andrzej Bialecki  commented on NUTCH-692:
-----------------------------------------

We should review this issue after the upgrade to Hadoop 0.20 - task output mgmt differs there, and the problem may be nonexistent.

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695346#action_12695346 ] 

Julien Nioche commented on NUTCH-692:
-------------------------------------

setting mapred.task.timeout to a small value (e.g. 60000) and trying to parse a page containing a stupidly long link, with for instance 2000 \ chars should be sufficient to put the basic-normalizer into trouble during a parsing. This should cause the task to fail and illustrate the issue

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Cosmin Lehene (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696429#action_12696429 ] 

Cosmin Lehene commented on NUTCH-692:
-------------------------------------

Julien, have you tried it with the patch? Can you confirm the behavior with a unpatched/patched nutch?

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702412#action_12702412 ] 

Julien Nioche commented on NUTCH-692:
-------------------------------------

OK I had the same problem again on my main cluster, one of the nodes lost contact with the master during a parsing and the subsequent attempts failed with AlreadyBeingCreatedException.

I managed to reproduce the problem locally using a fresh copy from SVN by hacking  the BasicURLNormalizer to make it sleep for 5 mins everytime it gets a URL, which gave me plenty of time to fail a reduce task with 

./hadoop job -fail-task attempt_200904241525_0007_r_000000_0

as expected the following attempts failed with AlreadyBeingCreatedException.

I did the same experiment using your patch and can confirm that it solves the problem. 

Thanks

J.

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "julien nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675518#action_12675518 ] 

julien nioche commented on NUTCH-692:
-------------------------------------

I have been investigating this a bit more. Same problem : some reduce tasks fail during the parsing and when the mapred.task.timeout is reached the new tasks can't get a lease for the files and we get the AlreadyBeingCreatedException. 

This is clearly a Hadoop issue; I have not tried with a previous version and don't know whether this will be fixed in the 0.19.1 release. Could this be due to the fact that the RecordWriter in ParseOutputFormat holds multiple Writers internally?

I had a look at the other side of the problem and found that for some documents the tasks were blocking on : 

	at org.apache.oro.text.regex.Util.substitute(Unknown Source)
	at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.substituteUnnecessaryRelativePaths(BasicURLNormalizer.java:166)
	at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:125)
	at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
	at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:223)
	at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:114)

and the following regex used in the regexurlfilter 

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

I haven't dumped the  actual URLS in the logs but I suspect that they come from the JSParser. I will remove both the regex-urlfilter and the BasicURLNormalizer and see what I get.

J.





> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694942#action_12694942 ] 

Julien Nioche commented on NUTCH-692:
-------------------------------------

As I pointed out in my previous message the root of the problem in my case was related to some dodgy URLs coming from the Javascript parser which put the basic normalizer into a spin. This would repeat in subsequent attempts indeed.

However the AlreadyBeingCreatedException should not happen and we should not have output files left open. If you patch fixes that I am sure that this will be a very welcome contribution.

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755956#action_12755956 ] 

Julien Nioche commented on NUTCH-692:
-------------------------------------

I've been using this patch for a while now and confirm that it fixes the problem. Could someone have a look at it and commit it so that we can close this issue?
Thanks
J.

> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.