You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/04/18 17:08:42 UTC

Could someone please share your experience with 0.8 step-by-step crawl??

Hi,

Are you guys able to run step-by-step crawl on 0.8 successfully?

I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8 tutorial 
for step-by-step crawling and got errors  for updatadb. I used two reduce 
tasks and two map tasks. Here are the exact steps that I did:

1. bin/nutch inject test/crawldb urls
2. bin/nutch generate test/crawldb test/segments
3. bin/nutch fetch test/segments/20060415143555
4. bin/nutch updatedb test/crawldb test/segments/20060415143555

Fetch one more round:
5. bin/nutch generate test/crawldb test/segments -topN 100
6. bin/nutch fetch test/segments/20060415150130
7. bin/nutch updatedb test/crawldb test/segments/20060415150130

Fetch one more round:
8. bin/nutch generate test/crawldb test/segments -topN 100
9. bin/nutch fetch test/segments/20060415151309

The the steps above ran successfully and I kept checking the directories in 
DFS
and doing nutch readdb and everything appeared to be fine.

Then:
10. bin/nutch updatedb test/crawldb test/segments/20060415151309

It failed with the following error for the two reduce tasks (the following 
log was for one
of the two tasks):

java.rmi.RemoteException: java.io.IOException: Cannot create file 
/user/root/test/crawldb/670052811/part-00000/data on client 
DFSClient_-1133147307 at 
org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:615) at 
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at 
org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at 
org.apache.hadoop.ipc.Client.call(Client.java:303) at 
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at 
org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) 
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at 
org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at 
org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) 
at 
org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) 
at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:180) at 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at 
org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at 
org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at 
org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at 
org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) 
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)


Anything wrong with my steps? Is this a known bug?

Thank you for your help.

Olive

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Could someone please share your experience with 0.8step-by-step crawl??

Posted by mo...@richmondinformatics.com.

Hello Olive,

Quoting Olive g <ol...@hotmail.com>:

> Hi Monu,
>
> Thank you for your help. I double checked and I had plenty of disk 
> space and /tmp was not filled up either. For my test case, I tested 
> with only 200 urls.

No problem.  Don't let me pretend to be an expert, I'm just on a 
different part
of the same steep learning curve :)

> Also, is the string "670052811" in the path right? I did not see any 
> directory /user/root/test/crawldb/670052811/  while 
> /user/root/test/crawldb/part-00000/data was there, or it was just 
> some temp directory used by Nutch, and if that was the case, why 
> would it fail if I had a lot of free space?

The sequence of generate, fetch, updatedb, invertlinks works for me.  I index
later.

The structure of the segments "tree" looks like this in my case:

segments/20060330035131
segments/20060330035131/content
segments/20060330035131/crawl_fetch
segments/20060330035131/crawl_generate
segments/20060330035131/crawl_parse
segments/20060330035131/parse_data
segments/20060330035131/parse_text

Here, the name of each segment is derived from the date and time, and 
this seems
to be the default behaviour of nutch 0.8 with hadoop 0.1

segments/20060330035131/parse_text/part-00000
segments/20060330035131/parse_text/part-00001
segments/20060330035131/parse_text/part-00002
segments/20060330035131/parse_text/part-00003
segments/20060330035131/parse_text/part-00004
segments/20060330035131/parse_text/part-00005
segments/20060330035131/parse_text/part-00006
segments/20060330035131/parse_text/part-00007
segments/20060330035131/parse_text/part-00008
segments/20060330035131/parse_text/part-00009
segments/20060330035131/parse_text/part-00010
segments/20060330035131/parse_text/part-00011
segments/20060330035131/parse_text/part-00012

As you see above, I haven't had a problem with the number of "parts".  Indeed,
here again the above was created with the default behaviour such as:

# bin/nutch generate crawl/db segments -topN 1250000

and

# bin/fetch segments/20060330035131

I don't know where this error comes from and maybe someone else can shed some
light on it.

> java.rmi.RemoteException: java.io.IOException: Cannot create file 
> /user/root/test/crawldb/670052811/part-00000/data on client 
> DFSClient_-1133147307 at
>
> How many reduce and map tasks did you use? I have been struggling 
> with this issue for a while and it seems to be that Nutch can't 
> handle more than 5 parts.

I am using a cluster of 1 x jobtracker and 6 x tasktrackers. Each has a single
Xeon 3Ghz processor, 2Gig RAM, Gigabit ethernet (over copper) and twin 400Gig
WD4000KD disks LVM'ed together.

In this configuration I've had the best performance using:

mapred.map.tasks - 61 (because the book says approx 10 x tasktrackers)
mapred.reduce.tasks - 6 (because it seems to work faster than 2 x 
tasktrackers)
mapred.tasktracker.tasks.maximum - 1 (because that's how many 
processors I have)

BTW, I got the last two figures from a conversation between YOU and Doug! :)

Good luck,

Monu

> Because of this, I am not able to run incrementail crawling. Please 
> see my previous message:
> http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04150.html
>
> Anybody has any insight?
> Thanks!
>
> Olive
>
>
>
>> From: monu.ogbe@richmondinformatics.com
>> Reply-To: nutch-user@lucene.apache.org
>> To: nutch-user@lucene.apache.org, Olive g <ol...@hotmail.com>
>> Subject: Re: Could someone please share your experience with 
>> 0.8step-by-step crawl??
>> Date: Tue, 18 Apr 2006 16:36:24 +0100
>>
>> Hello Olive,
>>
>> IIRC I got a similar message when the /tmp partition on my disks 
>> filled up.  I
>> then reconfigured the locations of all the directories in 
>> hadoop-site.xml to a
>> more spacious area of my disk.
>>
>> Hope that helps; see below:
>>
>> <property>
>>  <name>dfs.name.dir</name>
>>  <value>/home/nutch/hadoop/dfs/name</value>
>> </property>
>>
>> <property>
>>  <name>dfs.data.dir</name>
>>  <value>/home/nutch/hadoop/dfs/data</value>
>> </property>
>>
>> <property>
>>  <name>mapred.local.dir</name>
>>  <value>/home/nutch/hadoop/mapred/local</value>
>> </property>
>>
>> <property>
>>  <name>mapred.system.dir</name>
>>  <value>/home/nutch/hadoop/mapred/system</value>
>> </property>
>>
>> <property>
>>  <name>mapred.temp.dir</name>
>>  <value>/home/nutch/hadoop/mapred/temp</value>
>> </property>
>>
>>
>> Quoting Olive g <ol...@hotmail.com>:
>>
>>> Hi,
>>>
>>> Are you guys able to run step-by-step crawl on 0.8 successfully?
>>>
>>> I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8 
>>> tutorial for step-by-step crawling and got errors  for updatadb. I 
>>> used two reduce tasks and two map tasks. Here are the exact steps 
>>> that I did:
>>>
>>> 1. bin/nutch inject test/crawldb urls
>>> 2. bin/nutch generate test/crawldb test/segments
>>> 3. bin/nutch fetch test/segments/20060415143555
>>> 4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>>>
>>> Fetch one more round:
>>> 5. bin/nutch generate test/crawldb test/segments -topN 100
>>> 6. bin/nutch fetch test/segments/20060415150130
>>> 7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>>>
>>> Fetch one more round:
>>> 8. bin/nutch generate test/crawldb test/segments -topN 100
>>> 9. bin/nutch fetch test/segments/20060415151309
>>>
>>> The the steps above ran successfully and I kept checking the 
>>> directories in DFS
>>> and doing nutch readdb and everything appeared to be fine.
>>>
>>> Then:
>>> 10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>>>
>>> It failed with the following error for the two reduce tasks (the 
>>> following log was for one
>>> of the two tasks):
>>>
>>> java.rmi.RemoteException: java.io.IOException: Cannot create file 
>>> /user/root/test/crawldb/670052811/part-00000/data on client 
>>> DFSClient_-1133147307 at 
>>> org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at 
>>> org.apache.hadoo
>> p.fs.FileSystem.create(FileSystem.java:180) at 
>> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at 
>> org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at 
>> org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at 
>> org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at 
>> org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) 
>> at
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>>>
>>>
>>> Anything wrong with my steps? Is this a known bug?
>>>
>>> Thank you for your help.
>>>
>>> Olive
>>>
>>> _________________________________________________________________
>>> Express yourself instantly with MSN Messenger! Download today - 
>>> it's FREE! 
>>> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>>>
>>>
>>
>>
>>
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's 
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>

Re: Could someone please share your experience with 0.8step-by-step crawl??

Posted by Olive g <ol...@hotmail.com>.

Hi Monu,

Thank you for your help. I double checked and I had plenty of disk space and 
/tmp was not filled up either. For my test case, I tested with only 200 
urls.

Also, is the string "670052811" in the path right? I did not see any 
directory /user/root/test/crawldb/670052811/  while 
/user/root/test/crawldb/part-00000/data was there, or it was just some temp 
directory used by Nutch, and if that was the case, why would it fail if I 
had a lot of free space?

java.rmi.RemoteException: java.io.IOException: Cannot create file 
/user/root/test/crawldb/670052811/part-00000/data on client 
DFSClient_-1133147307 at

How many reduce and map tasks did you use? I have been struggling with this 
issue for a while and it seems to be that Nutch can't handle more than 5 
parts.

Because of this, I am not able to run incrementail crawling. Please see my 
previous message:
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04150.html

Anybody has any insight?
Thanks!

Olive



>From: monu.ogbe@richmondinformatics.com
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org, Olive g <ol...@hotmail.com>
>Subject: Re: Could someone please share your experience with 
>0.8step-by-step crawl??
>Date: Tue, 18 Apr 2006 16:36:24 +0100
>
>Hello Olive,
>
>IIRC I got a similar message when the /tmp partition on my disks filled up. 
>  I
>then reconfigured the locations of all the directories in hadoop-site.xml 
>to a
>more spacious area of my disk.
>
>Hope that helps; see below:
>
><property>
>  <name>dfs.name.dir</name>
>  <value>/home/nutch/hadoop/dfs/name</value>
></property>
>
><property>
>  <name>dfs.data.dir</name>
>  <value>/home/nutch/hadoop/dfs/data</value>
></property>
>
><property>
>  <name>mapred.local.dir</name>
>  <value>/home/nutch/hadoop/mapred/local</value>
></property>
>
><property>
>  <name>mapred.system.dir</name>
>  <value>/home/nutch/hadoop/mapred/system</value>
></property>
>
><property>
>  <name>mapred.temp.dir</name>
>  <value>/home/nutch/hadoop/mapred/temp</value>
></property>
>
>
>Quoting Olive g <ol...@hotmail.com>:
>
>>Hi,
>>
>>Are you guys able to run step-by-step crawl on 0.8 successfully?
>>
>>I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8 
>>tutorial for step-by-step crawling and got errors  for updatadb. I used 
>>two reduce tasks and two map tasks. Here are the exact steps that I did:
>>
>>1. bin/nutch inject test/crawldb urls
>>2. bin/nutch generate test/crawldb test/segments
>>3. bin/nutch fetch test/segments/20060415143555
>>4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>>
>>Fetch one more round:
>>5. bin/nutch generate test/crawldb test/segments -topN 100
>>6. bin/nutch fetch test/segments/20060415150130
>>7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>>
>>Fetch one more round:
>>8. bin/nutch generate test/crawldb test/segments -topN 100
>>9. bin/nutch fetch test/segments/20060415151309
>>
>>The the steps above ran successfully and I kept checking the directories 
>>in DFS
>>and doing nutch readdb and everything appeared to be fine.
>>
>>Then:
>>10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>>
>>It failed with the following error for the two reduce tasks (the following 
>>log was for one
>>of the two tasks):
>>
>>java.rmi.RemoteException: java.io.IOException: Cannot create file 
>>/user/root/test/crawldb/670052811/part-00000/data on client 
>>DFSClient_-1133147307 at 
>>org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at 
>>sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) 
>>at 
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>at java.lang.reflect.Method.invoke(Method.java:615) at 
>>org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at 
>>org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at 
>>org.apache.hadoop.ipc.Client.call(Client.java:303) at 
>>org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at 
>>org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at 
>>org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) 
>>at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at 
>>org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at 
>>org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) 
>>at 
>>org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) 
>>at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) 
>>at org.apache.hadoo
>p.fs.FileSystem.create(FileSystem.java:180) at 
>org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at 
>org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at 
>org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at 
>org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at 
>org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) 
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at
>>org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>>
>>
>>Anything wrong with my steps? Is this a known bug?
>>
>>Thank you for your help.
>>
>>Olive
>>
>>_________________________________________________________________
>>Express yourself instantly with MSN Messenger! Download today - it's FREE! 
>>http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>>
>>
>
>
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Could someone please share your experience with 0.8 step-by-step crawl??

Posted by mo...@richmondinformatics.com.

Hello Olive,

IIRC I got a similar message when the /tmp partition on my disks filled up.  I
then reconfigured the locations of all the directories in hadoop-site.xml to a
more spacious area of my disk.

Hope that helps; see below:

<property>
  <name>dfs.name.dir</name>
  <value>/home/nutch/hadoop/dfs/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/home/nutch/hadoop/dfs/data</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/nutch/hadoop/mapred/local</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/home/nutch/hadoop/mapred/system</value>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>/home/nutch/hadoop/mapred/temp</value>
</property>


Quoting Olive g <ol...@hotmail.com>:

> Hi,
>
> Are you guys able to run step-by-step crawl on 0.8 successfully?
>
> I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8 
> tutorial for step-by-step crawling and got errors  for updatadb. I 
> used two reduce tasks and two map tasks. Here are the exact steps 
> that I did:
>
> 1. bin/nutch inject test/crawldb urls
> 2. bin/nutch generate test/crawldb test/segments
> 3. bin/nutch fetch test/segments/20060415143555
> 4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>
> Fetch one more round:
> 5. bin/nutch generate test/crawldb test/segments -topN 100
> 6. bin/nutch fetch test/segments/20060415150130
> 7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>
> Fetch one more round:
> 8. bin/nutch generate test/crawldb test/segments -topN 100
> 9. bin/nutch fetch test/segments/20060415151309
>
> The the steps above ran successfully and I kept checking the 
> directories in DFS
> and doing nutch readdb and everything appeared to be fine.
>
> Then:
> 10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>
> It failed with the following error for the two reduce tasks (the 
> following log was for one
> of the two tasks):
>
> java.rmi.RemoteException: java.io.IOException: Cannot create file 
> /user/root/test/crawldb/670052811/part-00000/data on client 
> DFSClient_-1133147307 at 
> org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at org.apache.hadoo
 p.fs.FileSystem.create(FileSystem.java:180) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>
>
> Anything wrong with my steps? Is this a known bug?
>
> Thank you for your help.
>
> Olive
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's 
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>

Re: Could someone please share your experience with 0.8 step-by-step crawl??

Posted by mo...@richmondinformatics.com.

Hello Olive/Team,

Good news Olive (depending on your point of view); you are no longer alone.

I have had the identical errors during the reduce phaase of a fetch on 
a segment
of 1250000 URLs.

I will raise a separate request for help with this, containing full details.

TTFN

Monu

Quoting Olive g <ol...@hotmail.com>:

> Hi,
>
> Are you guys able to run step-by-step crawl on 0.8 successfully?
>
> I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8 
> tutorial for step-by-step crawling and got errors  for updatadb. I 
> used two reduce tasks and two map tasks. Here are the exact steps 
> that I did:
>
> 1. bin/nutch inject test/crawldb urls
> 2. bin/nutch generate test/crawldb test/segments
> 3. bin/nutch fetch test/segments/20060415143555
> 4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>
> Fetch one more round:
> 5. bin/nutch generate test/crawldb test/segments -topN 100
> 6. bin/nutch fetch test/segments/20060415150130
> 7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>
> Fetch one more round:
> 8. bin/nutch generate test/crawldb test/segments -topN 100
> 9. bin/nutch fetch test/segments/20060415151309
>
> The the steps above ran successfully and I kept checking the 
> directories in DFS
> and doing nutch readdb and everything appeared to be fine.
>
> Then:
> 10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>
> It failed with the following error for the two reduce tasks (the 
> following log was for one
> of the two tasks):
>
> java.rmi.RemoteException: java.io.IOException: Cannot create file 
> /user/root/test/crawldb/670052811/part-00000/data on client 
> DFSClient_-1133147307 at 
> org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at org.apache.hadoo
 p.fs.FileSystem.create(FileSystem.java:180) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>
>
> Anything wrong with my steps? Is this a known bug?
>
> Thank you for your help.
>
> Olive
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's 
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>