You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ferdy Galema <fe...@kalooga.com> on 2011/11/01 10:56:32 UTC

Re: Removing urls from crawl db

As for reading the crawldb, you can use 
org.apache.nutch.crawl.CrawlDbReader. This allows for dumping the 
crawldb into a readable textfile as well as querying individual urls. 
Run without args to see its usage.

On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> Hi
>
> Write an regex URL filter and use it the next time you update the db; it will
> disappear. Be sure to backup the db first in case your regex catches valid
> URL's. Nutch 1.5 will have an option to keep the previous version of the DB
> after update.
>
> cheers
>
>> We accidentally injected some urls into the crawl database and I need to go
>> remove them.  From what I understand, in 1.4 I can view and modify the urls
>> and indexes.  But I can't seem to find any information on how to do this.
>>
>> Is there anything regarding this available?

Re: Removing urls from crawl db

Posted by Bai Shen <ba...@gmail.com>.

No idea where I got that impression from.  I just thought it was one of the
reasons to move to 1.4 even though it's still in dev.

On Tue, Nov 1, 2011 at 4:54 PM, Markus Jelsma <ma...@openindex.io>wrote:

>
> > It seems like there would be a better way to do that.
>
> The problem is that there are many files storing URL's, CrawlDB, LinkDB,
> WebGraph DB's, segment data. There is in Nutch 1.x no single place where
> you
> can find an URL.
>
> For example, if we find URL patterns we don't want we write additional
> filters
> for it and have to update all DB's again, which can take minutes, hours or
> days depending on size and cluster capacity.
>
> >
> > I thought 1.4 was going to have a Luke style capability in regards to
> it's
> > data?
>
> Where did you read that? That is, unfortunately, not the case :)
>
> >
> > On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> > > > I think you must add a regex to regex-urlfilter.txt . In that case
> > > > those urls will not be fetched by fetcher.
> > >
> > > Yes but if you use it when doing updatedb it will disappear from the
> > > crawldb
> > > entirely.
> > >
> > > > -----Original Message-----
> > > > From: Bai Shen <ba...@gmail.com>
> > > > To: user <us...@nutch.apache.org>
> > > > Sent: Tue, Nov 1, 2011 10:35 am
> > > > Subject: Re: Removing urls from crawl db
> > > >
> > > >
> > > > Already did that.  But it doesn't allow me to delete urls from the
> list
> > >
> > > to
> > >
> > > > be crawled.
> > > >
> > > > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema
> > >
> > > <fe...@kalooga.com>wrote:
> > > > > As for reading the crawldb, you can use
> > > > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > > > > crawldb into a readable textfile as well as querying individual
> urls.
> > > > > Run without args to see its usage.
> > > > >
> > > > > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> > > > >> Hi
> > > > >>
> > > > >> Write an regex URL filter and use it the next time you update the
> > > > >> db;
> > >
> > > it
> > >
> > > > >> will
> > > > >> disappear. Be sure to backup the db first in case your regex
> catches
> > > > >> valid URL's. Nutch 1.5 will have an option to keep the previous
> > >
> > > version
> > >
> > > > >> of the DB
> > > > >> after update.
> > > > >>
> > > > >> cheers
> > > > >>
> > > > >>  We accidentally injected some urls into the crawl database and I
> > > > >>  need to
> > > > >>
> > > > >>> go
> > > > >>> remove them.  From what I understand, in 1.4 I can view and
> modify
> > >
> > > the
> > >
> > > > >>> urls
> > > > >>> and indexes.  But I can't seem to find any information on how to
> do
> > > > >>> this.
> > > > >>>
> > > > >>> Is there anything regarding this available?
>

Re: Removing urls from crawl db

Posted by Markus Jelsma <ma...@openindex.io>.

> It seems like there would be a better way to do that.

The problem is that there are many files storing URL's, CrawlDB, LinkDB, 
WebGraph DB's, segment data. There is in Nutch 1.x no single place where you 
can find an URL.

For example, if we find URL patterns we don't want we write additional filters 
for it and have to update all DB's again, which can take minutes, hours or 
days depending on size and cluster capacity.

> 
> I thought 1.4 was going to have a Luke style capability in regards to it's
> data?

Where did you read that? That is, unfortunately, not the case :)

> 
> On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > > I think you must add a regex to regex-urlfilter.txt . In that case
> > > those urls will not be fetched by fetcher.
> > 
> > Yes but if you use it when doing updatedb it will disappear from the
> > crawldb
> > entirely.
> > 
> > > -----Original Message-----
> > > From: Bai Shen <ba...@gmail.com>
> > > To: user <us...@nutch.apache.org>
> > > Sent: Tue, Nov 1, 2011 10:35 am
> > > Subject: Re: Removing urls from crawl db
> > > 
> > > 
> > > Already did that.  But it doesn't allow me to delete urls from the list
> > 
> > to
> > 
> > > be crawled.
> > > 
> > > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema
> > 
> > <fe...@kalooga.com>wrote:
> > > > As for reading the crawldb, you can use
> > > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > > > crawldb into a readable textfile as well as querying individual urls.
> > > > Run without args to see its usage.
> > > > 
> > > > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> > > >> Hi
> > > >> 
> > > >> Write an regex URL filter and use it the next time you update the
> > > >> db;
> > 
> > it
> > 
> > > >> will
> > > >> disappear. Be sure to backup the db first in case your regex catches
> > > >> valid URL's. Nutch 1.5 will have an option to keep the previous
> > 
> > version
> > 
> > > >> of the DB
> > > >> after update.
> > > >> 
> > > >> cheers
> > > >> 
> > > >>  We accidentally injected some urls into the crawl database and I
> > > >>  need to
> > > >>  
> > > >>> go
> > > >>> remove them.  From what I understand, in 1.4 I can view and modify
> > 
> > the
> > 
> > > >>> urls
> > > >>> and indexes.  But I can't seem to find any information on how to do
> > > >>> this.
> > > >>> 
> > > >>> Is there anything regarding this available?

Re: Removing urls from crawl db

Posted by Bai Shen <ba...@gmail.com>.

It seems like there would be a better way to do that.

I thought 1.4 was going to have a Luke style capability in regards to it's
data?

On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma <ma...@openindex.io>wrote:

>
> > I think you must add a regex to regex-urlfilter.txt . In that case those
> > urls will not be fetched by fetcher.
>
> Yes but if you use it when doing updatedb it will disappear from the
> crawldb
> entirely.
>
> >
> >
> > -----Original Message-----
> > From: Bai Shen <ba...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Tue, Nov 1, 2011 10:35 am
> > Subject: Re: Removing urls from crawl db
> >
> >
> > Already did that.  But it doesn't allow me to delete urls from the list
> to
> > be crawled.
> >
> > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema
> <fe...@kalooga.com>wrote:
> > > As for reading the crawldb, you can use
> > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > > crawldb into a readable textfile as well as querying individual urls.
> > > Run without args to see its usage.
> > >
> > > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> > >> Hi
> > >>
> > >> Write an regex URL filter and use it the next time you update the db;
> it
> > >> will
> > >> disappear. Be sure to backup the db first in case your regex catches
> > >> valid URL's. Nutch 1.5 will have an option to keep the previous
> version
> > >> of the DB
> > >> after update.
> > >>
> > >> cheers
> > >>
> > >>  We accidentally injected some urls into the crawl database and I need
> > >>  to
> > >>
> > >>> go
> > >>> remove them.  From what I understand, in 1.4 I can view and modify
> the
> > >>> urls
> > >>> and indexes.  But I can't seem to find any information on how to do
> > >>> this.
> > >>>
> > >>> Is there anything regarding this available?
>

Re: Removing urls from crawl db

Posted by Markus Jelsma <ma...@openindex.io>.

> I think you must add a regex to regex-urlfilter.txt . In that case those
> urls will not be fetched by fetcher.

Yes but if you use it when doing updatedb it will disappear from the crawldb 
entirely.

> 
> 
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Tue, Nov 1, 2011 10:35 am
> Subject: Re: Removing urls from crawl db
> 
> 
> Already did that.  But it doesn't allow me to delete urls from the list to
> be crawled.
> 
> On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema 
<fe...@kalooga.com>wrote:
> > As for reading the crawldb, you can use
> > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > crawldb into a readable textfile as well as querying individual urls.
> > Run without args to see its usage.
> > 
> > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> >> Hi
> >> 
> >> Write an regex URL filter and use it the next time you update the db; it
> >> will
> >> disappear. Be sure to backup the db first in case your regex catches
> >> valid URL's. Nutch 1.5 will have an option to keep the previous version
> >> of the DB
> >> after update.
> >> 
> >> cheers
> >> 
> >>  We accidentally injected some urls into the crawl database and I need
> >>  to
> >>  
> >>> go
> >>> remove them.  From what I understand, in 1.4 I can view and modify the
> >>> urls
> >>> and indexes.  But I can't seem to find any information on how to do
> >>> this.
> >>> 
> >>> Is there anything regarding this available?

Re: Removing urls from crawl db

Posted by al...@aim.com.

I think you must add a regex to regex-urlfilter.txt . In that case those urls will not be fetched by fetcher.
 

-----Original Message-----
From: Bai Shen <ba...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Tue, Nov 1, 2011 10:35 am
Subject: Re: Removing urls from crawl db


Already did that.  But it doesn't allow me to delete urls from the list to
be crawled.

On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema <fe...@kalooga.com>wrote:

> As for reading the crawldb, you can use org.apache.nutch.crawl.**CrawlDbReader.
> This allows for dumping the crawldb into a readable textfile as well as
> querying individual urls. Run without args to see its usage.
>
>
> On 10/31/2011 08:47 PM, Markus Jelsma wrote:
>
>> Hi
>>
>> Write an regex URL filter and use it the next time you update the db; it
>> will
>> disappear. Be sure to backup the db first in case your regex catches valid
>> URL's. Nutch 1.5 will have an option to keep the previous version of the
>> DB
>> after update.
>>
>> cheers
>>
>>  We accidentally injected some urls into the crawl database and I need to
>>> go
>>> remove them.  From what I understand, in 1.4 I can view and modify the
>>> urls
>>> and indexes.  But I can't seem to find any information on how to do this.
>>>
>>> Is there anything regarding this available?
>>>
>>

Re: Removing urls from crawl db

Posted by Bai Shen <ba...@gmail.com>.

Already did that.  But it doesn't allow me to delete urls from the list to
be crawled.

On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema <fe...@kalooga.com>wrote:

> As for reading the crawldb, you can use org.apache.nutch.crawl.**CrawlDbReader.
> This allows for dumping the crawldb into a readable textfile as well as
> querying individual urls. Run without args to see its usage.
>
>
> On 10/31/2011 08:47 PM, Markus Jelsma wrote:
>
>> Hi
>>
>> Write an regex URL filter and use it the next time you update the db; it
>> will
>> disappear. Be sure to backup the db first in case your regex catches valid
>> URL's. Nutch 1.5 will have an option to keep the previous version of the
>> DB
>> after update.
>>
>> cheers
>>
>>  We accidentally injected some urls into the crawl database and I need to
>>> go
>>> remove them.  From what I understand, in 1.4 I can view and modify the
>>> urls
>>> and indexes.  But I can't seem to find any information on how to do this.
>>>
>>> Is there anything regarding this available?
>>>
>>