You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Kevin.Y" <02...@163.com> on 2008/01/22 21:21:38 UTC

Need some advise about updating crawl data

I'm using nutch0.9 to crawl some specified "content" urls, such as
http://xxxxx/art/1.htm
http://xxxxx/art/2.htm
http://xxxxx/art/3.htm
....

Here is what I'm doing:
I put these "content" urls into an url.txt, then use "bin/nutch crawl"
command to run a crawl.
After that I get a crawl data , let me call it crawl_A.
I make crawl_A the search.dir of the webapp.
So far it can be searched normally.
I crawl another set of "content" urls ,I get crawl_B and I merge it with
crawl_A, using the script here:
http://wiki.apache.org/nutch/MergeCrawl
After merging I get a new merged one called crawl_C.
Then I stop the Tomcat , replace crawl_A with crawl_C ,  and then restart
it.

That's how I "update" my crawl data . And I don't think it's a smart
way...Especially i have to stop and restart the Tomcat otherwise I'll get
some file problems.

Is there any better way? Any advise will be appreciated!


-- 
View this message in context: http://www.nabble.com/Need-some-advise-about-updating-crawl-data-tp15027375p15027375.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Need some advise about updating crawl data

Posted by bhupal <bh...@research.iiit.ac.in>.

Hai Kevin,

After you replace the crawl folder, just do touch.

Use this command
touch your_webapp_folder/WEB-INF/web.xml

bye,
bhupal


Kevin.Y wrote:
> 
> I'm using nutch0.9 to crawl some specified "content" urls, such as
> http://xxxxx/art/1.htm
> http://xxxxx/art/2.htm
> http://xxxxx/art/3.htm
> ....
> 
> Here is what I'm doing:
> I put these "content" urls into an url.txt, then use "bin/nutch crawl"
> command to run a crawl.
> After that I get a crawl data , let me call it crawl_A.
> I make crawl_A the search.dir of the webapp.
> So far it can be searched normally.
> I crawl another set of "content" urls ,I get crawl_B and I merge it with
> crawl_A, using the script here:
> http://wiki.apache.org/nutch/MergeCrawl
> After merging I get a new merged one called crawl_C.
> Then I stop the Tomcat , replace crawl_A with crawl_C ,  and then restart
> it.
> 
> That's how I "update" my crawl data . And I don't think it's a smart
> way...Especially i have to stop and restart the Tomcat otherwise I'll get
> some file problems.
> 
> Is there any better way? Any advise will be appreciated!
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Need-some-advise-about-updating-crawl-data-tp15027375p15155283.html
Sent from the Nutch - User mailing list archive at Nabble.com.