You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ismael <kr...@gmail.com> on 2007/08/24 23:10:16 UTC

How to get the crawl database free of links to recrawl only from seed URL?

Hello,

I'm using Nutch 0.9 's jar to programming in Java to make crawls with
a predefined depth and I am having a problem when trying to recrawl,
and I don't know if I am solving it in the right way:

In the first crawl i have no problems, but when I recrawl in my crawl
database there are pages and links from the previous operation, so if
I first crawl with depth 1 and later I recrawl with depth 1 again is
like a depth 2 crawling. From an example:

I make a depth-1 crawling on www.fgfgfgfgfgfgf.com ; it recovers
information from that page and in that information there is a link to
www.vbvbvbvbvbvbvbvb.com.   When I recrawl with depth 1 again it
recrawls from the first web and from the second one, that was added in
the first crawl. So this is like I made a depth-2 crawling on the
first web, not a depth-1 recrawling.

To solve this when I recrawl I make a temporal crawl database
beginning with my seed URL at the desired depth, and run the crawling
on that one, and then I update my original database with the
information fetched with that temporal recrawl. To make it clear:
Recrawl:
	inject URL on <temp database>
	cycle:
		generate from <temp database>
		fetch from segment generated
		update both databases: <original database> and <temp database> with
fetched information
	end cycle
	generate index from <original database> information , delete
duplicates and merge with old index
	delete <temp database>


I would like to know if there is a better way to do recrawling
(without making a temp database, making something like removing links
from my database so it uses only seed URL in next recrawl. I didn't
find the way in Nutch 0.9 API) and if the way I solve the problem is
correct or has some bug that I will regret when the application is
almost done.

Thank you for reading!

Re: How to get the crawl database free of links to recrawl only from seed URL?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

Ismail, I've been having the ~ issue, i.e. I want to fetch all seeds first,
and then maybe other urls. In my script I therefore fetch with depth 1, and
set the property.
Setting the property helps if you plan on restarting the script, since when
it restarts and calls generate from crawldb it will ! find urls injected
from depth 1 during the previous crawl. However, you will ! have collected
any urls but from the seeds.

Solutions to the issue:

1. Fetch with depth 1 your seeds all in one go (vising bin/nutch generate
once). Once done with that you can crawl with greater depth.

If you cannot tolerate having to fetch all seeds at once:
2. Partition seeds into chunks you can tolerate to fetch at once. Then apply
1. to each chunck in a dedicated chunck crawldb (so that the chunk crawldb
has injected only chunck urls). At the end merge the results of the chunck
crawldb with a global crawldb.

3. Use db.update.additions.allowed in your seed crawl. Then before you
discard an indexed segment you read the discovered urls at that depth into a
discovered-urls-list. In your 2nd crawl you inject the discovered-urls-list
to crawl urls discovered in the previously crawl (and possibly new ones if
you don't set db.update.additions.allowed). Your script code would be
something like:

$nutch readseg -dump $it_segs crawl_parse_dump -nocontent -nofetch
-nogenerate -noparsedata -noparsetext
tstamp=`date -u +%s`
$hdfs -cat crawl_parse_dump/dump | grep 'http://' | sed 's/URL:: //'
> $tstamp
$hdfs -put tstamp $crawl/discovered-urls/

The simplest, but expensive:
4. crawl setting db.update.additions.allowed set to false and then crawl
again with it set to true, and db.fetch.interval.default set to 0 (or just
wipe out crawldb, I'm not sure what information remains useful in there).

I prefer 3 and 2, but I couldn't test 3. since readseg -dump expects a
single segment path, and I instead have an issue (
https://issues.apache.org/jira/browse/NUTCH-1001 NUTCH-1001 ) with that. I
intended to patch SegmentReader to receive a dir of segments and output a
single dump file, but since hadoop doesn't support append I'm ! sure how to
do it w/o much complication (PATCH WELCOME). Simpler could be outputting
multiple dumps, but then one must process each dump in turn too, and having
files on the hdfs makes manipulating files more complicated.

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-the-crawl-database-free-of-links-to-recrawl-only-from-seed-URL-tp613255p3022742.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get the crawl database free of links to recrawl only from seed URL?

Posted by Ismael <kr...@gmail.com>.

Thank you for answering, John:

That property won't help me, when that property is setted to false it
only updates information about the injected URLs, but it doesn't add
the information recovered in the cycle generate-fetch-update, so a
depth-5 crawling with this property setted to false is like five
depth-1 crawlings. I'm looking for a way to say to the crawl database
that it only generates pages from the injected URL's and the URL's
reached in that crawl operation, and when a new crawling is launched
it starts again to cycle only from the injected URL's and expanding in
the desired depth from this seed URL's.

Again, thank you for answering.

Ismael

2007/8/25, John Mendenhall <jo...@surfutopia.net>:
> > In the first crawl i have no problems, but when I recrawl in my crawl
> > database there are pages and links from the previous operation, so if
> > I first crawl with depth 1 and later I recrawl with depth 1 again is
> > like a depth 2 crawling. From an example:
> >
> > I make a depth-1 crawling on www.fgfgfgfgfgfgf.com ; it recovers
> > information from that page and in that information there is a link to
> > www.vbvbvbvbvbvbvbvb.com.   When I recrawl with depth 1 again it
> > recrawls from the first web and from the second one, that was added in
> > the first crawl. So this is like I made a depth-2 crawling on the
> > first web, not a depth-1 recrawling.
>
> I think you are looking for this property setting:
>
> <property>
>   <name>db.update.additions.allowed</name>
>   <value>false</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
> </property>
>
> I hope that helps.
>
> JohnM
>
> --
> john mendenhall
> john@surfutopia.net
> surf utopia
> internet services
>

Re: How to get the crawl database free of links to recrawl only from seed URL?

Posted by John Mendenhall <jo...@surfutopia.net>.

> In the first crawl i have no problems, but when I recrawl in my crawl
> database there are pages and links from the previous operation, so if
> I first crawl with depth 1 and later I recrawl with depth 1 again is
> like a depth 2 crawling. From an example:
> 
> I make a depth-1 crawling on www.fgfgfgfgfgfgf.com ; it recovers
> information from that page and in that information there is a link to
> www.vbvbvbvbvbvbvbvb.com.   When I recrawl with depth 1 again it
> recrawls from the first web and from the second one, that was added in
> the first crawl. So this is like I made a depth-2 crawling on the
> first web, not a depth-1 recrawling.

I think you are looking for this property setting:

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

I hope that helps.

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services