You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/04/18 17:08:42 UTC
Could someone please share your experience with 0.8 step-by-step crawl??
Hi,
Are you guys able to run step-by-step crawl on 0.8 successfully?
I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8 tutorial
for step-by-step crawling and got errors for updatadb. I used two reduce
tasks and two map tasks. Here are the exact steps that I did:
1. bin/nutch inject test/crawldb urls
2. bin/nutch generate test/crawldb test/segments
3. bin/nutch fetch test/segments/20060415143555
4. bin/nutch updatedb test/crawldb test/segments/20060415143555
Fetch one more round:
5. bin/nutch generate test/crawldb test/segments -topN 100
6. bin/nutch fetch test/segments/20060415150130
7. bin/nutch updatedb test/crawldb test/segments/20060415150130
Fetch one more round:
8. bin/nutch generate test/crawldb test/segments -topN 100
9. bin/nutch fetch test/segments/20060415151309
The the steps above ran successfully and I kept checking the directories in
DFS
and doing nutch readdb and everything appeared to be fine.
Then:
10. bin/nutch updatedb test/crawldb test/segments/20060415151309
It failed with the following error for the two reduce tasks (the following
log was for one
of the two tasks):
java.rmi.RemoteException: java.io.IOException: Cannot create file
/user/root/test/crawldb/670052811/part-00000/data on client
DFSClient_-1133147307 at
org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615) at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at
org.apache.hadoop.ipc.Client.call(Client.java:303) at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at
org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at
org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at
org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83)
at
org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39)
at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:180) at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at
org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at
org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at
org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at
org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
Anything wrong with my steps? Is this a known bug?
Thank you for your help.
Olive
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Could someone please share your experience with
0.8step-by-step crawl??
Posted by mo...@richmondinformatics.com.
Hello Olive,
Quoting Olive g <ol...@hotmail.com>:
> Hi Monu,
>
> Thank you for your help. I double checked and I had plenty of disk
> space and /tmp was not filled up either. For my test case, I tested
> with only 200 urls.
No problem. Don't let me pretend to be an expert, I'm just on a
different part
of the same steep learning curve :)
> Also, is the string "670052811" in the path right? I did not see any
> directory /user/root/test/crawldb/670052811/ while
> /user/root/test/crawldb/part-00000/data was there, or it was just
> some temp directory used by Nutch, and if that was the case, why
> would it fail if I had a lot of free space?
The sequence of generate, fetch, updatedb, invertlinks works for me. I index
later.
The structure of the segments "tree" looks like this in my case:
segments/20060330035131
segments/20060330035131/content
segments/20060330035131/crawl_fetch
segments/20060330035131/crawl_generate
segments/20060330035131/crawl_parse
segments/20060330035131/parse_data
segments/20060330035131/parse_text
Here, the name of each segment is derived from the date and time, and
this seems
to be the default behaviour of nutch 0.8 with hadoop 0.1
segments/20060330035131/parse_text/part-00000
segments/20060330035131/parse_text/part-00001
segments/20060330035131/parse_text/part-00002
segments/20060330035131/parse_text/part-00003
segments/20060330035131/parse_text/part-00004
segments/20060330035131/parse_text/part-00005
segments/20060330035131/parse_text/part-00006
segments/20060330035131/parse_text/part-00007
segments/20060330035131/parse_text/part-00008
segments/20060330035131/parse_text/part-00009
segments/20060330035131/parse_text/part-00010
segments/20060330035131/parse_text/part-00011
segments/20060330035131/parse_text/part-00012
As you see above, I haven't had a problem with the number of "parts". Indeed,
here again the above was created with the default behaviour such as:
# bin/nutch generate crawl/db segments -topN 1250000
and
# bin/fetch segments/20060330035131
I don't know where this error comes from and maybe someone else can shed some
light on it.
> java.rmi.RemoteException: java.io.IOException: Cannot create file
> /user/root/test/crawldb/670052811/part-00000/data on client
> DFSClient_-1133147307 at
>
> How many reduce and map tasks did you use? I have been struggling
> with this issue for a while and it seems to be that Nutch can't
> handle more than 5 parts.
I am using a cluster of 1 x jobtracker and 6 x tasktrackers. Each has a single
Xeon 3Ghz processor, 2Gig RAM, Gigabit ethernet (over copper) and twin 400Gig
WD4000KD disks LVM'ed together.
In this configuration I've had the best performance using:
mapred.map.tasks - 61 (because the book says approx 10 x tasktrackers)
mapred.reduce.tasks - 6 (because it seems to work faster than 2 x
tasktrackers)
mapred.tasktracker.tasks.maximum - 1 (because that's how many
processors I have)
BTW, I got the last two figures from a conversation between YOU and Doug! :)
Good luck,
Monu
> Because of this, I am not able to run incrementail crawling. Please
> see my previous message:
> http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04150.html
>
> Anybody has any insight?
> Thanks!
>
> Olive
>
>
>
>> From: monu.ogbe@richmondinformatics.com
>> Reply-To: nutch-user@lucene.apache.org
>> To: nutch-user@lucene.apache.org, Olive g <ol...@hotmail.com>
>> Subject: Re: Could someone please share your experience with
>> 0.8step-by-step crawl??
>> Date: Tue, 18 Apr 2006 16:36:24 +0100
>>
>> Hello Olive,
>>
>> IIRC I got a similar message when the /tmp partition on my disks
>> filled up. I
>> then reconfigured the locations of all the directories in
>> hadoop-site.xml to a
>> more spacious area of my disk.
>>
>> Hope that helps; see below:
>>
>> <property>
>> <name>dfs.name.dir</name>
>> <value>/home/nutch/hadoop/dfs/name</value>
>> </property>
>>
>> <property>
>> <name>dfs.data.dir</name>
>> <value>/home/nutch/hadoop/dfs/data</value>
>> </property>
>>
>> <property>
>> <name>mapred.local.dir</name>
>> <value>/home/nutch/hadoop/mapred/local</value>
>> </property>
>>
>> <property>
>> <name>mapred.system.dir</name>
>> <value>/home/nutch/hadoop/mapred/system</value>
>> </property>
>>
>> <property>
>> <name>mapred.temp.dir</name>
>> <value>/home/nutch/hadoop/mapred/temp</value>
>> </property>
>>
>>
>> Quoting Olive g <ol...@hotmail.com>:
>>
>>> Hi,
>>>
>>> Are you guys able to run step-by-step crawl on 0.8 successfully?
>>>
>>> I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8
>>> tutorial for step-by-step crawling and got errors for updatadb. I
>>> used two reduce tasks and two map tasks. Here are the exact steps
>>> that I did:
>>>
>>> 1. bin/nutch inject test/crawldb urls
>>> 2. bin/nutch generate test/crawldb test/segments
>>> 3. bin/nutch fetch test/segments/20060415143555
>>> 4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>>>
>>> Fetch one more round:
>>> 5. bin/nutch generate test/crawldb test/segments -topN 100
>>> 6. bin/nutch fetch test/segments/20060415150130
>>> 7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>>>
>>> Fetch one more round:
>>> 8. bin/nutch generate test/crawldb test/segments -topN 100
>>> 9. bin/nutch fetch test/segments/20060415151309
>>>
>>> The the steps above ran successfully and I kept checking the
>>> directories in DFS
>>> and doing nutch readdb and everything appeared to be fine.
>>>
>>> Then:
>>> 10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>>>
>>> It failed with the following error for the two reduce tasks (the
>>> following log was for one
>>> of the two tasks):
>>>
>>> java.rmi.RemoteException: java.io.IOException: Cannot create file
>>> /user/root/test/crawldb/670052811/part-00000/data on client
>>> DFSClient_-1133147307 at
>>> org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at
>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at
>>> org.apache.hadoo
>> p.fs.FileSystem.create(FileSystem.java:180) at
>> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at
>> org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at
>> org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at
>> org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at
>> org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265)
>> at
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>>>
>>>
>>> Anything wrong with my steps? Is this a known bug?
>>>
>>> Thank you for your help.
>>>
>>> Olive
>>>
>>> _________________________________________________________________
>>> Express yourself instantly with MSN Messenger! Download today -
>>> it's FREE!
>>> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>>>
>>>
>>
>>
>>
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>
Re: Could someone please share your experience with 0.8step-by-step crawl??
Posted by Olive g <ol...@hotmail.com>.
Hi Monu,
Thank you for your help. I double checked and I had plenty of disk space and
/tmp was not filled up either. For my test case, I tested with only 200
urls.
Also, is the string "670052811" in the path right? I did not see any
directory /user/root/test/crawldb/670052811/ while
/user/root/test/crawldb/part-00000/data was there, or it was just some temp
directory used by Nutch, and if that was the case, why would it fail if I
had a lot of free space?
java.rmi.RemoteException: java.io.IOException: Cannot create file
/user/root/test/crawldb/670052811/part-00000/data on client
DFSClient_-1133147307 at
How many reduce and map tasks did you use? I have been struggling with this
issue for a while and it seems to be that Nutch can't handle more than 5
parts.
Because of this, I am not able to run incrementail crawling. Please see my
previous message:
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04150.html
Anybody has any insight?
Thanks!
Olive
>From: monu.ogbe@richmondinformatics.com
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org, Olive g <ol...@hotmail.com>
>Subject: Re: Could someone please share your experience with
>0.8step-by-step crawl??
>Date: Tue, 18 Apr 2006 16:36:24 +0100
>
>Hello Olive,
>
>IIRC I got a similar message when the /tmp partition on my disks filled up.
> I
>then reconfigured the locations of all the directories in hadoop-site.xml
>to a
>more spacious area of my disk.
>
>Hope that helps; see below:
>
><property>
> <name>dfs.name.dir</name>
> <value>/home/nutch/hadoop/dfs/name</value>
></property>
>
><property>
> <name>dfs.data.dir</name>
> <value>/home/nutch/hadoop/dfs/data</value>
></property>
>
><property>
> <name>mapred.local.dir</name>
> <value>/home/nutch/hadoop/mapred/local</value>
></property>
>
><property>
> <name>mapred.system.dir</name>
> <value>/home/nutch/hadoop/mapred/system</value>
></property>
>
><property>
> <name>mapred.temp.dir</name>
> <value>/home/nutch/hadoop/mapred/temp</value>
></property>
>
>
>Quoting Olive g <ol...@hotmail.com>:
>
>>Hi,
>>
>>Are you guys able to run step-by-step crawl on 0.8 successfully?
>>
>>I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8
>>tutorial for step-by-step crawling and got errors for updatadb. I used
>>two reduce tasks and two map tasks. Here are the exact steps that I did:
>>
>>1. bin/nutch inject test/crawldb urls
>>2. bin/nutch generate test/crawldb test/segments
>>3. bin/nutch fetch test/segments/20060415143555
>>4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>>
>>Fetch one more round:
>>5. bin/nutch generate test/crawldb test/segments -topN 100
>>6. bin/nutch fetch test/segments/20060415150130
>>7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>>
>>Fetch one more round:
>>8. bin/nutch generate test/crawldb test/segments -topN 100
>>9. bin/nutch fetch test/segments/20060415151309
>>
>>The the steps above ran successfully and I kept checking the directories
>>in DFS
>>and doing nutch readdb and everything appeared to be fine.
>>
>>Then:
>>10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>>
>>It failed with the following error for the two reduce tasks (the following
>>log was for one
>>of the two tasks):
>>
>>java.rmi.RemoteException: java.io.IOException: Cannot create file
>>/user/root/test/crawldb/670052811/part-00000/data on client
>>DFSClient_-1133147307 at
>>org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at
>>sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>>at
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.lang.reflect.Method.invoke(Method.java:615) at
>>org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at
>>org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at
>>org.apache.hadoop.ipc.Client.call(Client.java:303) at
>>org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at
>>org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at
>>org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587)
>>at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at
>>org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at
>>org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83)
>>at
>>org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39)
>>at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128)
>>at org.apache.hadoo
>p.fs.FileSystem.create(FileSystem.java:180) at
>org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at
>org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at
>org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at
>org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at
>org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at
>>org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>>
>>
>>Anything wrong with my steps? Is this a known bug?
>>
>>Thank you for your help.
>>
>>Olive
>>
>>_________________________________________________________________
>>Express yourself instantly with MSN Messenger! Download today - it's FREE!
>>http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>>
>>
>
>
>
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Could someone please share your experience with 0.8
step-by-step crawl??
Posted by mo...@richmondinformatics.com.
Hello Olive,
IIRC I got a similar message when the /tmp partition on my disks filled up. I
then reconfigured the locations of all the directories in hadoop-site.xml to a
more spacious area of my disk.
Hope that helps; see below:
<property>
<name>dfs.name.dir</name>
<value>/home/nutch/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/nutch/hadoop/dfs/data</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/nutch/hadoop/mapred/local</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/nutch/hadoop/mapred/system</value>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/home/nutch/hadoop/mapred/temp</value>
</property>
Quoting Olive g <ol...@hotmail.com>:
> Hi,
>
> Are you guys able to run step-by-step crawl on 0.8 successfully?
>
> I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8
> tutorial for step-by-step crawling and got errors for updatadb. I
> used two reduce tasks and two map tasks. Here are the exact steps
> that I did:
>
> 1. bin/nutch inject test/crawldb urls
> 2. bin/nutch generate test/crawldb test/segments
> 3. bin/nutch fetch test/segments/20060415143555
> 4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>
> Fetch one more round:
> 5. bin/nutch generate test/crawldb test/segments -topN 100
> 6. bin/nutch fetch test/segments/20060415150130
> 7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>
> Fetch one more round:
> 8. bin/nutch generate test/crawldb test/segments -topN 100
> 9. bin/nutch fetch test/segments/20060415151309
>
> The the steps above ran successfully and I kept checking the
> directories in DFS
> and doing nutch readdb and everything appeared to be fine.
>
> Then:
> 10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>
> It failed with the following error for the two reduce tasks (the
> following log was for one
> of the two tasks):
>
> java.rmi.RemoteException: java.io.IOException: Cannot create file
> /user/root/test/crawldb/670052811/part-00000/data on client
> DFSClient_-1133147307 at
> org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at org.apache.hadoo
p.fs.FileSystem.create(FileSystem.java:180) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>
>
> Anything wrong with my steps? Is this a known bug?
>
> Thank you for your help.
>
> Olive
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>
Re: Could someone please share your experience with 0.8
step-by-step crawl??
Posted by mo...@richmondinformatics.com.
Hello Olive/Team,
Good news Olive (depending on your point of view); you are no longer alone.
I have had the identical errors during the reduce phaase of a fetch on
a segment
of 1250000 URLs.
I will raise a separate request for help with this, containing full details.
TTFN
Monu
Quoting Olive g <ol...@hotmail.com>:
> Hi,
>
> Are you guys able to run step-by-step crawl on 0.8 successfully?
>
> I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8
> tutorial for step-by-step crawling and got errors for updatadb. I
> used two reduce tasks and two map tasks. Here are the exact steps
> that I did:
>
> 1. bin/nutch inject test/crawldb urls
> 2. bin/nutch generate test/crawldb test/segments
> 3. bin/nutch fetch test/segments/20060415143555
> 4. bin/nutch updatedb test/crawldb test/segments/20060415143555
>
> Fetch one more round:
> 5. bin/nutch generate test/crawldb test/segments -topN 100
> 6. bin/nutch fetch test/segments/20060415150130
> 7. bin/nutch updatedb test/crawldb test/segments/20060415150130
>
> Fetch one more round:
> 8. bin/nutch generate test/crawldb test/segments -topN 100
> 9. bin/nutch fetch test/segments/20060415151309
>
> The the steps above ran successfully and I kept checking the
> directories in DFS
> and doing nutch readdb and everything appeared to be fine.
>
> Then:
> 10. bin/nutch updatedb test/crawldb test/segments/20060415151309
>
> It failed with the following error for the two reduce tasks (the
> following log was for one
> of the two tasks):
>
> java.rmi.RemoteException: java.io.IOException: Cannot create file
> /user/root/test/crawldb/670052811/part-00000/data on client
> DFSClient_-1133147307 at
> org.apache.hadoop.dfs.NameNode.create(NameNode.java:137) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) at org.apache.hadoo
p.fs.FileSystem.create(FileSystem.java:180) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) at org.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265) at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)
>
>
> Anything wrong with my steps? Is this a known bug?
>
> Thank you for your help.
>
> Olive
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>