You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Pratik Garg <sa...@gmail.com> on 2012/12/04 17:28:10 UTC

CrawlData and seed url structure for nutch

Hi ,

I have been trying to use/implement nutch for our client. I have gone
through the forums and online documentation but I am not clear with the
structure of the crawldb and the url's one should keep.

*What I did*
I have a crawldba and corresponding crawlurl folder containing seed.txt
which I provide to nutch for crawling and later use them for indexing solr
via nutch.

*What I cannot do*
If I add new url's to the seed.txt they are not picked up and instead just
the old pages are crawled again and updated.

*Solution*
Add a new folder for the crawldata and urls , ask nutch to crawl them and
then index them. Which basically means everything I have to add a new URL I
have to go through this process??

Any help will be appreciated.

Thanks,
Pratik

RE: CrawlData and seed url structure for nutch

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Pratik Garg <sa...@gmail.com>
> Sent: Wed 05-Dec-2012 19:18
> To: user@nutch.apache.org
> Cc: Chirag Goel <go...@gmail.com>
> Subject: CrawlData and seed url structure for nutch
> 
> Hi ,
> 
> I have been trying to use/implement nutch for our client. I have gone
> through the forums and online documentation but I am not clear with the
> structure of the crawldb and the url's one should keep.
> 
> *What I did*
> I have a crawldba and corresponding crawlurl folder containing seed.txt
> which I provide to nutch for crawling and later use them for indexing solr
> via nutch.
> 
> *What I cannot do*
> If I add new url's to the seed.txt they are not picked up and instead just
> the old pages are crawled again and updated.

Correct, you need to inject (add) them to the crawldb via the injector tool. It works the same as you created the crawldb in the first place.

> 
> *Solution*
> Add a new folder for the crawldata and urls , ask nutch to crawl them and
> then index them. Which basically means everything I have to add a new URL I
> have to go through this process??
> 
> Any help will be appreciated.
> 
> Thanks,
> Pratik
>