You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gal Nitzan <gn...@usa.net> on 2005/09/29 13:56:45 UTC

java.io.IOException: Cannot create file (in reduce task)

Hello,

I'm testing mapred on one machine only.

Everything worked fine from the start until I got the exception in the 
reduce task:

Diagnostic Text

java.io.IOException: Cannot create file 
/user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
at org.apache.nutch.ipc.Client.call(Client.java:294) at 
org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at 
$Proxy1.create(Unknown Source) at 
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(NDFSClient.java:574) 
at 
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.(NDFSClient.java:549) 
at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at 
org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:76) at 
org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:71) at 
org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:94) at 
org.apache.nutch.io.MapFile$Writer.(MapFile.java:108) at 
org.apache.nutch.io.MapFile$Writer.(MapFile.java:76) at 
org.apache.nutch.crawl.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:48) 
at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:245) at 
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:580)

In the jontracker log:

050928 155253 Server connection on port 8011 from 127.0.0.1: exiting
050928 160814 Server connection on port 8011 from 127.0.0.1: starting
050928 160814 parsing file:/mapred/conf/nutch-default.xml
050928 160814 parsing file:/mapred/conf/mapred-default.xml
050928 160814 parsing /nutch/mapred/local/job_s4isvd.xml
050928 160814 parsing file:/mapred/conf/nutch-site.xml
050928 160814 parsing file:/mapred/conf/nutch-default.xml
050928 160815 parsing file:/mapred/conf/mapred-default.xml
050928 160815 parsing /nutch/mapred/local/job_s4isvd.xml
050928 160815 parsing file:/mapred/conf/nutch-site.xml
050928 160815 Adding task 'task_m_ax7n90' to set for tracker 'tracker_41883'
050928 160821 Task 'task_m_ax7n90' has finished successfully.
050928 160821 Adding task 'task_m_vl2bge' to set for tracker 'tracker_41883'
050928 160827 Task 'task_m_vl2bge' has finished successfully.
050928 160827 Adding task 'task_m_i54kht' to set for tracker 'tracker_41883'
050928 160830 Task 'task_m_i54kht' has finished successfully.
050928 160830 Adding task 'task_m_1eymym' to set for tracker 'tracker_41883'
050928 160833 Task 'task_m_1eymym' has finished successfully.
050928 160833 Adding task 'task_r_w9azpi' to set for tracker 'tracker_41883'
050928 160839 Task 'task_r_w9azpi' has finished successfully.
050928 160839 Server connection on port 8011 from 127.0.0.1: exiting
050928 171406 Task 'task_m_klo24y' has finished successfully.
050928 171406 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171434 Task 'task_r_x48xa3' has been lost.
050928 171434 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171501 Task 'task_r_x48xa3' has been lost.
050928 171501 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171520 Task 'task_r_x48xa3' has been lost.
050928 171520 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171551 Task 'task_r_x48xa3' has been lost.
050928 171551 Task task_r_x48xa3 has failed 4 times.  Aborting owning 
job job_mtzp7h
050928 171552 Server connection on port 8011 from 127.0.0.1: exiting

In namenode log

050928 171547 Server handler on 8009 call error: java.io.IOException: 
Cannot create file 
/user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
java.io.IOException: Cannot create file 
/user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
        at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:324)
        at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
        at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)

In fetch log

050928 171526  reduce 47%
050928 171538  reduce 50%
050928 171551  reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:309)
        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
        at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:364)

Any idea, anyone?

Thanks, Gal

Re: java.io.IOException: Cannot create file (in reduce task)

Posted by Rod Taylor <rb...@sitesell.com>.

On Thu, 2005-09-29 at 11:43 -0700, Doug Cutting wrote:
> Gal Nitzan wrote:
> > I believe you are right. When I checked the file I noticed it exists.
> > 
> > However, I run the fetcher only once on that segment.
> 
> Please try the attached patch and tell me if it fixes this for you.

The patch in this thread eliminated a few errors I was having.

-- 
Rod Taylor <rb...@sitesell.com>

Re: java.io.IOException: Cannot create file (in reduce task)

Posted by Doug Cutting <cu...@nutch.org>.

Gal Nitzan wrote:
> I believe you are right. When I checked the file I noticed it exists.
> 
> However, I run the fetcher only once on that segment.

Please try the attached patch and tell me if it fixes this for you.

Doug

Re: New plugin

Posted by Gal Nitzan <gn...@usa.net>.

John X wrote:
> Hi, Gal,
>
> Yes, I am interested. You can post the tarball to
> http://issues.apache.org/jira/browse/Nutch
>
> Thanks,
>
> John
>
> On Thu, Sep 29, 2005 at 09:53:42PM +0200, Gal Nitzan wrote:
>   
>> Hi,
>>
>> I have written (not much) a new plugin, based on the URLFilter 
>> interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains, i.e. I would like to 
>> crawl the world but to fetch only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) 
>> and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver, connection string, table 
>> to use and domain field from nutch-site.xml
>>
>> Since I do not have the tools to add it to the svn and all, If someone 
>> is interested let me know and I can mail it.
>>
>> Regards,
>>
>> Gal
>>
>>     
> __________________________________________
> http://www.neasys.com - A Good Place to Be
> Come to visit us today!
>
> .
>
>   

Done. enjoy: http://issues.apache.org/jira/browse/NUTCH-100

Regards, Gal

Re: New plugin

Posted by John X <jo...@neasys.com>.

Hi, Gal,

Yes, I am interested. You can post the tarball to
http://issues.apache.org/jira/browse/Nutch

Thanks,

John

On Thu, Sep 29, 2005 at 09:53:42PM +0200, Gal Nitzan wrote:
> Hi,
> 
> I have written (not much) a new plugin, based on the URLFilter 
> interface: urlfilter-db .
> 
> The purpose of this plugin is to filter domains, i.e. I would like to 
> crawl the world but to fetch only certain domains.
> 
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) 
> and on the back-end a database.
> 
> For each url
>    filter is called
> end for
> 
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> 
> 
> The plugin reads the cache size, jdbc driver, connection string, table 
> to use and domain field from nutch-site.xml
> 
> Since I do not have the tools to add it to the svn and all, If someone 
> is interested let me know and I can mail it.
> 
> Regards,
> 
> Gal
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!

New plugin

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

I have written (not much) a new plugin, based on the URLFilter 
interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to 
crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) 
and on the back-end a database.

For each url
    filter is called
end for

filter
  get the domain name from url
   call cache.get domain
   if not in cache try the database
   if in database cache it and return it
   return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table 
to use and domain field from nutch-site.xml

Since I do not have the tools to add it to the svn and all, If someone 
is interested let me know and I can mail it.

Regards,

Gal

Re: java.io.IOException: Cannot create file (in reduce task)

Posted by Gal Nitzan <gn...@usa.net>.

Doug Cutting wrote:
> Thanks for the detailed report.  This is a bug.  The problem is that 
> the default is not to permit files to be overwritten.  But when a 
> reduce task re-executes (because something failed) it needs to 
> overwrite data.  My guess is that the cause of the initial failure 
> might have been the same: that this was not your first attempt to 
> fetch this segment, that you were overwriting the last attempt.  Is 
> that right, or did something else first cause the reduce task to fail?
>
> I think the fix is to change the filesystem code (local and NDFS) so 
> that overwriting is permitted by default.  With MapReduce, tasks may 
> be re-executed, so overwriting is normal.  Application code should add 
> error checking code at the start to check that output files do not 
> already exist if we wish to prevent unintentional overwriting.
>
> If there are no objections, I will make this change in the mapred branch.
>
> Doug
>
> Gal Nitzan wrote:
>> Hello,
>>
>> I'm testing mapred on one machine only.
>>
>> Everything worked fine from the start until I got the exception in 
>> the reduce task:
>>
>> Diagnostic Text
>>
>> java.io.IOException: Cannot create file 
>> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
>> at org.apache.nutch.ipc.Client.call(Client.java:294) at 
>> org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at 
>> $Proxy1.create(Unknown Source) at 
>> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(NDFSClient.java:574) 
>> at 
>> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.(NDFSClient.java:549) 
>> at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at 
>> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:76) at 
>> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:71) at 
>> org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:94) at 
>> org.apache.nutch.io.MapFile$Writer.(MapFile.java:108) at 
>> org.apache.nutch.io.MapFile$Writer.(MapFile.java:76) at 
>> org.apache.nutch.crawl.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:48) 
>> at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:245) at 
>> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:580)
>>
>> In the jontracker log:
>>
>> 050928 155253 Server connection on port 8011 from 127.0.0.1: exiting
>> 050928 160814 Server connection on port 8011 from 127.0.0.1: starting
>> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
>> 050928 160814 parsing file:/mapred/conf/mapred-default.xml
>> 050928 160814 parsing /nutch/mapred/local/job_s4isvd.xml
>> 050928 160814 parsing file:/mapred/conf/nutch-site.xml
>> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
>> 050928 160815 parsing file:/mapred/conf/mapred-default.xml
>> 050928 160815 parsing /nutch/mapred/local/job_s4isvd.xml
>> 050928 160815 parsing file:/mapred/conf/nutch-site.xml
>> 050928 160815 Adding task 'task_m_ax7n90' to set for tracker 
>> 'tracker_41883'
>> 050928 160821 Task 'task_m_ax7n90' has finished successfully.
>> 050928 160821 Adding task 'task_m_vl2bge' to set for tracker 
>> 'tracker_41883'
>> 050928 160827 Task 'task_m_vl2bge' has finished successfully.
>> 050928 160827 Adding task 'task_m_i54kht' to set for tracker 
>> 'tracker_41883'
>> 050928 160830 Task 'task_m_i54kht' has finished successfully.
>> 050928 160830 Adding task 'task_m_1eymym' to set for tracker 
>> 'tracker_41883'
>> 050928 160833 Task 'task_m_1eymym' has finished successfully.
>> 050928 160833 Adding task 'task_r_w9azpi' to set for tracker 
>> 'tracker_41883'
>> 050928 160839 Task 'task_r_w9azpi' has finished successfully.
>> 050928 160839 Server connection on port 8011 from 127.0.0.1: exiting
>> 050928 171406 Task 'task_m_klo24y' has finished successfully.
>> 050928 171406 Adding task 'task_r_x48xa3' to set for tracker 
>> 'tracker_41883'
>> 050928 171434 Task 'task_r_x48xa3' has been lost.
>> 050928 171434 Adding task 'task_r_x48xa3' to set for tracker 
>> 'tracker_41883'
>> 050928 171501 Task 'task_r_x48xa3' has been lost.
>> 050928 171501 Adding task 'task_r_x48xa3' to set for tracker 
>> 'tracker_41883'
>> 050928 171520 Task 'task_r_x48xa3' has been lost.
>> 050928 171520 Adding task 'task_r_x48xa3' to set for tracker 
>> 'tracker_41883'
>> 050928 171551 Task 'task_r_x48xa3' has been lost.
>> 050928 171551 Task task_r_x48xa3 has failed 4 times.  Aborting owning 
>> job job_mtzp7h
>> 050928 171552 Server connection on port 8011 from 127.0.0.1: exiting
>>
>> In namenode log
>>
>> 050928 171547 Server handler on 8009 call error: java.io.IOException: 
>> Cannot create file 
>> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
>>
>> java.io.IOException: Cannot create file 
>> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
>>
>>        at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
>>        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>>
>>        at java.lang.reflect.Method.invoke(Method.java:324)
>>        at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
>>        at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
>>
>> In fetch log
>>
>> 050928 171526  reduce 47%
>> 050928 171538  reduce 50%
>> 050928 171551  reduce 100%
>> Exception in thread "main" java.io.IOException: Job failed!
>>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:309)
>>        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>>        at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:364)
>>
>> Any idea, anyone?
>>
>> Thanks, Gal
>>
>
> .
>
I believe you are right. When I checked the file I noticed it exists.

However, I run the fetcher only once on that segment.

Gal

Re: java.io.IOException: Cannot create file (in reduce task)

Posted by Doug Cutting <cu...@nutch.org>.

Thanks for the detailed report.  This is a bug.  The problem is that the 
default is not to permit files to be overwritten.  But when a reduce 
task re-executes (because something failed) it needs to overwrite data. 
  My guess is that the cause of the initial failure might have been the 
same: that this was not your first attempt to fetch this segment, that 
you were overwriting the last attempt.  Is that right, or did something 
else first cause the reduce task to fail?

I think the fix is to change the filesystem code (local and NDFS) so 
that overwriting is permitted by default.  With MapReduce, tasks may be 
re-executed, so overwriting is normal.  Application code should add 
error checking code at the start to check that output files do not 
already exist if we wish to prevent unintentional overwriting.

If there are no objections, I will make this change in the mapred branch.

Doug

Gal Nitzan wrote:
> Hello,
> 
> I'm testing mapred on one machine only.
> 
> Everything worked fine from the start until I got the exception in the 
> reduce task:
> 
> Diagnostic Text
> 
> java.io.IOException: Cannot create file 
> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
> at org.apache.nutch.ipc.Client.call(Client.java:294) at 
> org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at 
> $Proxy1.create(Unknown Source) at 
> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(NDFSClient.java:574) 
> at 
> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.(NDFSClient.java:549) 
> at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at 
> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:76) at 
> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:71) at 
> org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:94) at 
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:108) at 
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:76) at 
> org.apache.nutch.crawl.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:48) 
> at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:245) at 
> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:580)
> 
> In the jontracker log:
> 
> 050928 155253 Server connection on port 8011 from 127.0.0.1: exiting
> 050928 160814 Server connection on port 8011 from 127.0.0.1: starting
> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
> 050928 160814 parsing file:/mapred/conf/mapred-default.xml
> 050928 160814 parsing /nutch/mapred/local/job_s4isvd.xml
> 050928 160814 parsing file:/mapred/conf/nutch-site.xml
> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
> 050928 160815 parsing file:/mapred/conf/mapred-default.xml
> 050928 160815 parsing /nutch/mapred/local/job_s4isvd.xml
> 050928 160815 parsing file:/mapred/conf/nutch-site.xml
> 050928 160815 Adding task 'task_m_ax7n90' to set for tracker 
> 'tracker_41883'
> 050928 160821 Task 'task_m_ax7n90' has finished successfully.
> 050928 160821 Adding task 'task_m_vl2bge' to set for tracker 
> 'tracker_41883'
> 050928 160827 Task 'task_m_vl2bge' has finished successfully.
> 050928 160827 Adding task 'task_m_i54kht' to set for tracker 
> 'tracker_41883'
> 050928 160830 Task 'task_m_i54kht' has finished successfully.
> 050928 160830 Adding task 'task_m_1eymym' to set for tracker 
> 'tracker_41883'
> 050928 160833 Task 'task_m_1eymym' has finished successfully.
> 050928 160833 Adding task 'task_r_w9azpi' to set for tracker 
> 'tracker_41883'
> 050928 160839 Task 'task_r_w9azpi' has finished successfully.
> 050928 160839 Server connection on port 8011 from 127.0.0.1: exiting
> 050928 171406 Task 'task_m_klo24y' has finished successfully.
> 050928 171406 Adding task 'task_r_x48xa3' to set for tracker 
> 'tracker_41883'
> 050928 171434 Task 'task_r_x48xa3' has been lost.
> 050928 171434 Adding task 'task_r_x48xa3' to set for tracker 
> 'tracker_41883'
> 050928 171501 Task 'task_r_x48xa3' has been lost.
> 050928 171501 Adding task 'task_r_x48xa3' to set for tracker 
> 'tracker_41883'
> 050928 171520 Task 'task_r_x48xa3' has been lost.
> 050928 171520 Adding task 'task_r_x48xa3' to set for tracker 
> 'tracker_41883'
> 050928 171551 Task 'task_r_x48xa3' has been lost.
> 050928 171551 Task task_r_x48xa3 has failed 4 times.  Aborting owning 
> job job_mtzp7h
> 050928 171552 Server connection on port 8011 from 127.0.0.1: exiting
> 
> In namenode log
> 
> 050928 171547 Server handler on 8009 call error: java.io.IOException: 
> Cannot create file 
> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
> 
> java.io.IOException: Cannot create file 
> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data 
> 
>        at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
>        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
> 
>        at java.lang.reflect.Method.invoke(Method.java:324)
>        at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
>        at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
> 
> In fetch log
> 
> 050928 171526  reduce 47%
> 050928 171538  reduce 50%
> 050928 171551  reduce 100%
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:309)
>        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>        at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:364)
> 
> Any idea, anyone?
> 
> Thanks, Gal
>