You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/09/19 17:46:43 UTC

freegen handles duplicate (reccurent urls) in crawldb?

Hi,

I've been advised to use the 'freegen' tool in order to generate & fetch a
new and fresh url list (from a txt file) whilst disregarding any depth x
urls injected to the crawldb
as a result of previous fetches.

I've run a small test and noticed that by using the freegen tool, nutch
doesnt check for duplicate urls that has been fetched already and are in the
crawldb.
meaning, that i can fetch a number of urls from a certain segment, update
them to the db,
and then use generate/freegen with the same urls and fetch them and it will
not check if those urls are already fetched (resulting in an unnecessary
fetch action).
i'm quite convinced that when doing updatedb afterwards, it will remove
dups, but still it's not efficent.

anyway, i will be glad if someone could help with that. (perhaps contradict
me even).


-- 
Eyal Edri

Re: freegen handles duplicate (reccurent urls) in crawldb?

Posted by Andrzej Bialecki <ab...@getopt.org>.

eyal edri wrote:
> Hi,
> 
> I've been advised to use the 'freegen' tool in order to generate & fetch a
> new and fresh url list (from a txt file) whilst disregarding any depth x
> urls injected to the crawldb
> as a result of previous fetches.
> 
> I've run a small test and noticed that by using the freegen tool, nutch
> doesnt check for duplicate urls that has been fetched already and are in the
> crawldb.
> meaning, that i can fetch a number of urls from a certain segment, update
> them to the db,
> and then use generate/freegen with the same urls and fetch them and it will
> not check if those urls are already fetched (resulting in an unnecessary
> fetch action).
> i'm quite convinced that when doing updatedb afterwards, it will remove
> dups, but still it's not efficent.
> 
> anyway, i will be glad if someone could help with that. (perhaps contradict
> me even).

You are correct, that's the way this tool works. When you use generate, 
it does check the crawldb. freegen, as the name implies, allows you to 
create arbitrary fetchlists, without checking the crawldb.

If you want to be able to generate arbitrary fetchlist which contain 
only those urls that are absent from crawldb or unfetched, then you need 
to write another tool to do that.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com