You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by remi tassing <ta...@gmail.com> on 2012/01/19 14:43:13 UTC

Partly remove already crawled urls

Hi,

Let's say my filters in regex-urlfilter.txt weren't well written and I
crawled outside my wanted boundaries. Now I noticed it and want to remove
those urls.

what would you recommend to do?

Remi

Re: Partly remove already crawled urls

Posted by Markus Jelsma <ma...@openindex.io>.

You can remove the URL's via filters from the webgraph in trunk. We added 
filter and normalize options.

On Friday 20 January 2012 17:59:23 Marek Bachmann wrote:
> Am 20.01.2012 17:55, schrieb Markus Jelsma:
> > On Friday 20 January 2012 17:36:28 Lewis John Mcgibbney wrote:
> >> Hi Marek,
> >> 
> >>> What happens with the data in the segments, I guess the data of the
> >>> crawled urls are still in the segments after filtering out from the
> >>> crawldb, aren't they?
> >> 
> >> Well additionally you can merge several segments and again use
> >> urlfilters to get rid of urls you don't wish to have. Again Andrzej has
> >> provided some excellent accompanying documentation with this class so
> >> most of it should be in there. This does sound awfully like a chicken
> >> and egg scenario though!
> >> 
> >>> I have similar problems, as I often notice that there are broken pages
> >>> which generate infinite urls.
> >> 
> >> This is interesting, can you provide an example please.
> > 
> > This usually happens if bad relative URL's are used. They continue to
> > grow in length and still return proper content with even longer URL's. 
> > Quite a horror but easy to filter using regex.
> 
> Yep you got it exactly :)
> 
> No problem with nutch at all as I said to Lewis few seconds ago.
> 
> The nutch related problem is that I want to remove them from segs AND db
> since I think otherwise they will be have influence to the webgraph.
> 
> >> But I think the data of the urls will remain in the segments.
> >> 
> >> See above, does this begin to answer?
> >> 
> >>> This won't be so bad but I wonder if this will make problems in the
> >>> webgraph since this tool only uses the segment directory??
> >>> 
> >>> ...
> >>> 
> >>> 
> >>> Cheers,

-- 
Markus Jelsma - CTO - Openindex

Re: Partly remove already crawled urls

Posted by Marek Bachmann <m....@uni-kassel.de>.

Am 20.01.2012 17:55, schrieb Markus Jelsma:
>
>
> On Friday 20 January 2012 17:36:28 Lewis John Mcgibbney wrote:
>> Hi Marek,
>>
>>> What happens with the data in the segments, I guess the data of the
>>> crawled urls are still in the segments after filtering out from the
>>> crawldb, aren't they?
>>
>> Well additionally you can merge several segments and again use urlfilters
>> to get rid of urls you don't wish to have. Again Andrzej has provided some
>> excellent accompanying documentation with this class so most of it should
>> be in there. This does sound awfully like a chicken and egg scenario
>> though!
>>
>>> I have similar problems, as I often notice that there are broken pages
>>> which generate infinite urls.
>>
>> This is interesting, can you provide an example please.
>
> This usually happens if bad relative URL's are used. They continue to grow in
> length and still return proper content with even longer URL's.  Quite a horror
> but easy to filter using regex.

Yep you got it exactly :)

No problem with nutch at all as I said to Lewis few seconds ago.

The nutch related problem is that I want to remove them from segs AND db 
since I think otherwise they will be have influence to the webgraph.

>>
>> But I think the data of the urls will remain in the segments.
>>
>> See above, does this begin to answer?
>>
>>> This won't be so bad but I wonder if this will make problems in the
>>> webgraph since this tool only uses the segment directory??
>>>
>>> ...
>>>
>>>
>>> Cheers,
>

Re: Partly remove already crawled urls

Posted by Markus Jelsma <ma...@openindex.io>.


On Friday 20 January 2012 17:36:28 Lewis John Mcgibbney wrote:
> Hi Marek,
> 
> > What happens with the data in the segments, I guess the data of the
> > crawled urls are still in the segments after filtering out from the
> > crawldb, aren't they?
> 
> Well additionally you can merge several segments and again use urlfilters
> to get rid of urls you don't wish to have. Again Andrzej has provided some
> excellent accompanying documentation with this class so most of it should
> be in there. This does sound awfully like a chicken and egg scenario
> though!
> 
> > I have similar problems, as I often notice that there are broken pages
> > which generate infinite urls.
> 
> This is interesting, can you provide an example please.

This usually happens if bad relative URL's are used. They continue to grow in 
length and still return proper content with even longer URL's.  Quite a horror 
but easy to filter using regex.
> 
> But I think the data of the urls will remain in the segments.
> 
> See above, does this begin to answer?
> 
> > This won't be so bad but I wonder if this will make problems in the
> > webgraph since this tool only uses the segment directory??
> > 
> > ...
> > 
> > 
> > Cheers,

-- 
Markus Jelsma - CTO - Openindex

Re: Partly remove already crawled urls

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hi Lewis,

thanks for your reply.

Am 20.01.2012 17:36, schrieb Lewis John Mcgibbney:
> Hi Marek,
>
>
>> What happens with the data in the segments, I guess the data of the
>> crawled urls are still in the segments after filtering out from the
>> crawldb, aren't they?
>>
>
> Well additionally you can merge several segments and again use urlfilters
> to get rid of urls you don't wish to have. Again Andrzej has provided some
> excellent accompanying documentation with this class so most of it should
> be in there. This does sound awfully like a chicken and egg scenario though!
>
Yes, but I have the problem that I always get EOFExceptions or HeapSpace 
errors when trying to merge and filter the segments.. :-/
I thought my Cluster is to weak for the amount of data.
>
>> I have similar problems, as I often notice that there are broken pages
>> which generate infinite urls.
>>
>
> This is interesting, can you provide an example please.
This is no problem with nutch, these urls are discovered normaly but the 
webserver generates new urls for the same content, like sites with 
session ids. (Actually these pages make things like http://xyz.com/a and 
then linking to http://xyz.com/a/a/.../a and so on) I need a way to 
delete them from the db and the segments after discovering the problem.

Just now I'am trying to solve it through deleting all old segs and then 
re-fetch everything. Because I don't want to loose all discovered urls. 
But for some reason this approach doesn't work even if I set -addDays to 
a value higher than the db.fetch.interval.max value (1209600 s = 14 days 
in my case and I said ./nuch generate uniall/crawldb uniall/segs 
-addDays 30)

Still debugging it.

>
> But I think the data of the urls will remain in the segments.
>
> See above, does this begin to answer?
>
>
>> This won't be so bad but I wonder if this will make problems in the
>> webgraph since this tool only uses the segment directory??
>>
>> ...
>
>
>> Cheers,
>>
>

Re: Partly remove already crawled urls

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Marek,


> What happens with the data in the segments, I guess the data of the
> crawled urls are still in the segments after filtering out from the
> crawldb, aren't they?
>

Well additionally you can merge several segments and again use urlfilters
to get rid of urls you don't wish to have. Again Andrzej has provided some
excellent accompanying documentation with this class so most of it should
be in there. This does sound awfully like a chicken and egg scenario though!


> I have similar problems, as I often notice that there are broken pages
> which generate infinite urls.
>

This is interesting, can you provide an example please.

But I think the data of the urls will remain in the segments.

See above, does this begin to answer?


> This won't be so bad but I wonder if this will make problems in the
> webgraph since this tool only uses the segment directory??
>
> ...


> Cheers,
>

Re: Partly remove already crawled urls

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hello,

What happens with the data in the segments, I guess the data of the 
crawled urls are still in the segments after filtering out from the 
crawldb, aren't they?

I have similar problems, as I often notice that there are broken pages 
which generate infinite urls.
If I find them I make filter-rules to avoid fetching them and filter my 
crawldb.
But I think the data of the urls will remain in the segments. This won't 
be so bad but I wonder if this will make problems in the webgraph since 
this tool only uses the segment directory??

Cheers,

Marek

Am 19.01.2012 21:35, schrieb Lewis John Mcgibbney:
> You may try to use the CrawlDbMerger [1] with the -filter switch to remove
> urls from the resulting merge.
>
> Please read Andrzej's comments carefully to avoid any unintended behavior.
>
> [1]
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java?view=markup
>
> On Thu, Jan 19, 2012 at 8:26 PM, remi tassing<ta...@gmail.com>  wrote:
>
>> The main purpose is to remove urls matching a certain pattern from the
>> Nutch segments(or database).
>>
>> Remi
>>
>> On Thursday, January 19, 2012, Lewis John Mcgibbney<
>> lewis.mcgibbney@gmail.com>  wrote:
>>> Maintenance tool for what? You still haven't explicitly mentioned what it
>>> is that your trying to do.
>>>
>>> On Thu, Jan 19, 2012 at 2:19 PM, remi tassing<ta...@gmail.com>
>> wrote:
>>>
>>>> Plz advice for maintainance tool for Nutch.
>>>>
>>>> I heard of Luke for Solr, I'll try it.
>>>>
>>>> Remi
>>>>
>>>> On Thu, Jan 19, 2012 at 4:00 PM, Lewis John Mcgibbney<
>>>> lewis.mcgibbney@gmail.com>  wrote:
>>>>
>>>>> It depends where you are wanting to remove the urls from... your Nutch
>>>>> crawldb or your Solr index?
>>>>>
>>>>> We offer and maintain quite a number of tools to enable you to
>> maintain a
>>>>> healthy crawldb e.g. purge, filtering, etc, we also maintain some
>> tools
>>>> to
>>>>> help you maintain your Solr index e.g. delete duplicates, solr clean,
>>>> etc,
>>>>> but in terms of explicitly deleting URLs from either I am not sure
>> about
>>>>> this.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> On Thu, Jan 19, 2012 at 1:43 PM, remi tassing<ta...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Let's say my filters in regex-urlfilter.txt weren't well written and
>> I
>>>>>> crawled outside my wanted boundaries. Now I noticed it and want to
>>>> remove
>>>>>> those urls.
>>>>>>
>>>>>> what would you recommend to do?
>>>>>>
>>>>>> Remi
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>
>
>

Re: Partly remove already crawled urls

Posted by Lewis John Mcgibbney <le...@gmail.com>.

You may try to use the CrawlDbMerger [1] with the -filter switch to remove
urls from the resulting merge.

Please read Andrzej's comments carefully to avoid any unintended behavior.

[1]
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java?view=markup

On Thu, Jan 19, 2012 at 8:26 PM, remi tassing <ta...@gmail.com> wrote:

> The main purpose is to remove urls matching a certain pattern from the
> Nutch segments(or database).
>
> Remi
>
> On Thursday, January 19, 2012, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
> > Maintenance tool for what? You still haven't explicitly mentioned what it
> > is that your trying to do.
> >
> > On Thu, Jan 19, 2012 at 2:19 PM, remi tassing <ta...@gmail.com>
> wrote:
> >
> >> Plz advice for maintainance tool for Nutch.
> >>
> >> I heard of Luke for Solr, I'll try it.
> >>
> >> Remi
> >>
> >> On Thu, Jan 19, 2012 at 4:00 PM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >> > It depends where you are wanting to remove the urls from... your Nutch
> >> > crawldb or your Solr index?
> >> >
> >> > We offer and maintain quite a number of tools to enable you to
> maintain a
> >> > healthy crawldb e.g. purge, filtering, etc, we also maintain some
> tools
> >> to
> >> > help you maintain your Solr index e.g. delete duplicates, solr clean,
> >> etc,
> >> > but in terms of explicitly deleting URLs from either I am not sure
> about
> >> > this.
> >> >
> >> > Any thoughts?
> >> >
> >> > On Thu, Jan 19, 2012 at 1:43 PM, remi tassing <ta...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Let's say my filters in regex-urlfilter.txt weren't well written and
> I
> >> > > crawled outside my wanted boundaries. Now I noticed it and want to
> >> remove
> >> > > those urls.
> >> > >
> >> > > what would you recommend to do?
> >> > >
> >> > > Remi
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > *Lewis*
> >> >
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Partly remove already crawled urls

Posted by remi tassing <ta...@gmail.com>.

The main purpose is to remove urls matching a certain pattern from the
Nutch segments(or database).

Remi

On Thursday, January 19, 2012, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Maintenance tool for what? You still haven't explicitly mentioned what it
> is that your trying to do.
>
> On Thu, Jan 19, 2012 at 2:19 PM, remi tassing <ta...@gmail.com>
wrote:
>
>> Plz advice for maintainance tool for Nutch.
>>
>> I heard of Luke for Solr, I'll try it.
>>
>> Remi
>>
>> On Thu, Jan 19, 2012 at 4:00 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>> > It depends where you are wanting to remove the urls from... your Nutch
>> > crawldb or your Solr index?
>> >
>> > We offer and maintain quite a number of tools to enable you to
maintain a
>> > healthy crawldb e.g. purge, filtering, etc, we also maintain some tools
>> to
>> > help you maintain your Solr index e.g. delete duplicates, solr clean,
>> etc,
>> > but in terms of explicitly deleting URLs from either I am not sure
about
>> > this.
>> >
>> > Any thoughts?
>> >
>> > On Thu, Jan 19, 2012 at 1:43 PM, remi tassing <ta...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > Let's say my filters in regex-urlfilter.txt weren't well written and
I
>> > > crawled outside my wanted boundaries. Now I noticed it and want to
>> remove
>> > > those urls.
>> > >
>> > > what would you recommend to do?
>> > >
>> > > Remi
>> > >
>> >
>> >
>> >
>> > --
>> > *Lewis*
>> >
>>
>
>
>
> --
> *Lewis*
>

Re: Partly remove already crawled urls

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Maintenance tool for what? You still haven't explicitly mentioned what it
is that your trying to do.

On Thu, Jan 19, 2012 at 2:19 PM, remi tassing <ta...@gmail.com> wrote:

> Plz advice for maintainance tool for Nutch.
>
> I heard of Luke for Solr, I'll try it.
>
> Remi
>
> On Thu, Jan 19, 2012 at 4:00 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > It depends where you are wanting to remove the urls from... your Nutch
> > crawldb or your Solr index?
> >
> > We offer and maintain quite a number of tools to enable you to maintain a
> > healthy crawldb e.g. purge, filtering, etc, we also maintain some tools
> to
> > help you maintain your Solr index e.g. delete duplicates, solr clean,
> etc,
> > but in terms of explicitly deleting URLs from either I am not sure about
> > this.
> >
> > Any thoughts?
> >
> > On Thu, Jan 19, 2012 at 1:43 PM, remi tassing <ta...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Let's say my filters in regex-urlfilter.txt weren't well written and I
> > > crawled outside my wanted boundaries. Now I noticed it and want to
> remove
> > > those urls.
> > >
> > > what would you recommend to do?
> > >
> > > Remi
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Partly remove already crawled urls

Posted by remi tassing <ta...@gmail.com>.

Plz advice for maintainance tool for Nutch.

I heard of Luke for Solr, I'll try it.

Remi

On Thu, Jan 19, 2012 at 4:00 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> It depends where you are wanting to remove the urls from... your Nutch
> crawldb or your Solr index?
>
> We offer and maintain quite a number of tools to enable you to maintain a
> healthy crawldb e.g. purge, filtering, etc, we also maintain some tools to
> help you maintain your Solr index e.g. delete duplicates, solr clean, etc,
> but in terms of explicitly deleting URLs from either I am not sure about
> this.
>
> Any thoughts?
>
> On Thu, Jan 19, 2012 at 1:43 PM, remi tassing <ta...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Let's say my filters in regex-urlfilter.txt weren't well written and I
> > crawled outside my wanted boundaries. Now I noticed it and want to remove
> > those urls.
> >
> > what would you recommend to do?
> >
> > Remi
> >
>
>
>
> --
> *Lewis*
>

Re: Partly remove already crawled urls

Posted by Lewis John Mcgibbney <le...@gmail.com>.

It depends where you are wanting to remove the urls from... your Nutch
crawldb or your Solr index?

We offer and maintain quite a number of tools to enable you to maintain a
healthy crawldb e.g. purge, filtering, etc, we also maintain some tools to
help you maintain your Solr index e.g. delete duplicates, solr clean, etc,
but in terms of explicitly deleting URLs from either I am not sure about
this.

Any thoughts?

On Thu, Jan 19, 2012 at 1:43 PM, remi tassing <ta...@gmail.com> wrote:

> Hi,
>
> Let's say my filters in regex-urlfilter.txt weren't well written and I
> crawled outside my wanted boundaries. Now I noticed it and want to remove
> those urls.
>
> what would you recommend to do?
>
> Remi
>

-- 
*Lewis*