You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Mike Baranczak (JIRA)" <ji...@apache.org> on 2010/06/14 01:19:13 UTC

[jira] Created: (NUTCH-829) duplicate hadoop temp files

duplicate hadoop temp files
---------------------------

                 Key: NUTCH-829
                 URL: https://issues.apache.org/jira/browse/NUTCH-829
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 1.0.0, 1.1
            Reporter: Mike Baranczak
            Priority: Minor


When two crawls are started at exactly the same time, I see the following error: 
{quote}
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/tmp/hadoop-mike/mapred/temp/generate-temp-1276463469075 already exists
	at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
	at org.apache.nutch.crawl.Generator.generate(Generator.java:472)
	at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
        [...]
{quote}

I traced it down to this code in Generator (I'm using Nutch 1.0, but this is still in the trunk):

{quote}
Path tempDir =
      new Path(getConf().get("mapred.temp.dir", ".") +
               "/generate-temp-"+ System.currentTimeMillis());
{quote}

I admit that this is an unlikely scenario for most users, but it just so happens that I ran into it. To absolutely guarantee that the temp directory doesn't already exist, I suggest changing System.currentTimeMillis() to java.util.UUID.randomUUID().toString().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-829) duplicate hadoop temp files

Posted by "Alex McLintock (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882554#action_12882554 ] 

Alex McLintock commented on NUTCH-829:
--------------------------------------

java.util.UUID was only introduced in Java 1.5

But I read from the FAQ that 

    What Java version is required to run Nutch?
    Nutch 0.7 will run with Java 1.4 and up. Nutch 1.0 with Java 6. 

So I guess that is fine. 

> duplicate hadoop temp files
> ---------------------------
>
>                 Key: NUTCH-829
>                 URL: https://issues.apache.org/jira/browse/NUTCH-829
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0, 1.1
>            Reporter: Mike Baranczak
>            Priority: Minor
>
> When two crawls are started at exactly the same time, I see the following error: 
> {quote}
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/tmp/hadoop-mike/mapred/temp/generate-temp-1276463469075 already exists
> 	at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
> 	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> 	at org.apache.nutch.crawl.Generator.generate(Generator.java:472)
> 	at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
>         [...]
> {quote}
> I traced it down to this code in Generator (I'm using Nutch 1.0, but this is still in the trunk):
> {quote}
> Path tempDir =
>       new Path(getConf().get("mapred.temp.dir", ".") +
>                "/generate-temp-"+ System.currentTimeMillis());
> {quote}
> I admit that this is an unlikely scenario for most users, but it just so happens that I ran into it. To absolutely guarantee that the temp directory doesn't already exist, I suggest changing System.currentTimeMillis() to java.util.UUID.randomUUID().toString().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.