You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/02/03 21:26:53 UTC

Re: takes too long to remove a page from WEBDB

And also it makes no sense, since it will come back as soon the link  
is found on a page.
Use a url filter instead  and remove it from the index.
Removing from webdb makes no sense.

Am 03.02.2006 um 21:27 schrieb Keren Yu:

> Hi everyone,
>
> It took about 10 minutes to remove a page from WEBDB
> using WebDBWriter. Does anyone know other method to
> remove a page, which is faster.
>
> Thanks,
> Keren
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: ProtocolStatus.MOVED

Posted by Stefan Groschupf <sg...@media-style.com>.

This is just in case a page forward the parser to a new url.
Real url filtering is done in (nutch .8):
ParseOutputFormat line:100.
HTH
Stefan

Am 04.02.2006 um 06:31 schrieb Fuad Efendi:

> I am checking also Fetcher, it seems strange for me:
>
> Case ProtocolStatus.MOVED:
> Case ProtocolStatus.TEMP_MOVED:
> 	handleFetch(fle, output);
> 	String newurl = pstat.getMessage();
> 	newurl = URLFilters.filter(newurl);
>
> So, we are calling "handleFetch" before "filter"... Error?
>
> EnabledHost - sends redirect to DisabledHost
> DisabledHost - parsed(!), links to unknown-hosts are probably  
> stored (in not
> disabled explicitly)
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Friday, February 03, 2006 11:47 PM
> To: nutch-user@lucene.apache.org
> Cc: nutch-dev@lucene.apache.org
> Subject: RE: takes too long to remove a page from WEBDB
>
>
> We have following code:
>
> org.apache.nutch.parse.ParseOutputFormat.java
> ...
> [94]	toUrl = urlNormalizer.normalize(toUrl);
> [95]	toUrl = URLFilters.filter(toUrl);
> ...
>
>
> It normalizes, then filters normalized URL, than writes it to / 
> crawl_parse
>
> In some cases normalized URL is not same as raw URL, and it is not  
> filtered.
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Friday, February 03, 2006 10:53 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: takes too long to remove a page from WEBDB
>
>
> It will also be generated in case if non-filtered page have "Send  
> Redirect"
> to another page (which should be filtered)...
>
> I have same problem in my modified DOMContentUtils.java,
> ...
> if (url.getHost().equals(base.getHost())) { outlinks.add 
> (..........); }
> ...
>
> - it doesn't help, I see some URLs from "filtered" hosts again...
>
>
> -----Original Message-----
> From: Keren Yu [mailto:kerenyu@yahoo.com]
> Sent: Friday, February 03, 2006 4:01 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: takes too long to remove a page from WEBDB
>
>
> Hi Stefan,
>
> As I understand, when you use 'nutch generate' to
> generate fetch list, it doesn't call urlfilter. Only
> in 'nutch updatedb' and 'nutch fetch' it does call
> urlfilter. So the page after 30 days will be generated
> even if you use url filter to filter it.
>
> Best regards,
> Keren
>
> --- Stefan Groschupf <sg...@media-style.com> wrote:
>
>> not if you filter it in the url filter.
>> There is a database based url filter I think in the
>> jira somewhere
>> somehow, this can help to filter larger lists of
>> urls.
>>
>> Am 03.02.2006 um 21:35 schrieb Keren Yu:
>>
>>> Hi Stefan,
>>>
>>> Thank you. You are right. I have to use a url
>> filter
>>> and remove it from the index. But after 30 days
>> later,
>>> the page will be generated again in generating
>> fetch
>>> list.
>>>
>>> Thanks,
>>> Keren
>>>
>>> --- Stefan Groschupf <sg...@media-style.com> wrote:
>>>
>>>> And also it makes no sense, since it will come
>> back
>>>> as soon the link
>>>> is found on a page.
>>>> Use a url filter instead  and remove it from the
>>>> index.
>>>> Removing from webdb makes no sense.
>>>>
>>>> Am 03.02.2006 um 21:27 schrieb Keren Yu:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> It took about 10 minutes to remove a page from
>>>> WEBDB
>>>>> using WebDBWriter. Does anyone know other method
>>>> to
>>>>> remove a page, which is faster.
>>>>>
>>>>> Thanks,
>>>>> Keren
>>>>>
>>>>>
>> __________________________________________________
>>>>> Do You Yahoo!?
>>>>> Tired of spam?  Yahoo! Mail has the best spam
>>>> protection around
>>>>> http://mail.yahoo.com
>>>>>
>>>>
>>>>
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam
>> protection around
>>> http://mail.yahoo.com
>>>
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>
>
>
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

ProtocolStatus.MOVED

Posted by Fuad Efendi <fu...@efendi.ca>.

I am checking also Fetcher, it seems strange for me:

Case ProtocolStatus.MOVED:
Case ProtocolStatus.TEMP_MOVED:
	handleFetch(fle, output);
	String newurl = pstat.getMessage();
	newurl = URLFilters.filter(newurl);

So, we are calling "handleFetch" before "filter"... Error?

EnabledHost - sends redirect to DisabledHost
DisabledHost - parsed(!), links to unknown-hosts are probably stored (in not
disabled explicitly)


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Friday, February 03, 2006 11:47 PM
To: nutch-user@lucene.apache.org
Cc: nutch-dev@lucene.apache.org
Subject: RE: takes too long to remove a page from WEBDB


We have following code:

org.apache.nutch.parse.ParseOutputFormat.java
...
[94]	toUrl = urlNormalizer.normalize(toUrl);
[95]	toUrl = URLFilters.filter(toUrl);
...


It normalizes, then filters normalized URL, than writes it to /crawl_parse

In some cases normalized URL is not same as raw URL, and it is not filtered.


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Friday, February 03, 2006 10:53 PM
To: nutch-user@lucene.apache.org
Subject: RE: takes too long to remove a page from WEBDB


It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..........); }
...

- it doesn't help, I see some URLs from "filtered" hosts again...


-----Original Message-----
From: Keren Yu [mailto:kerenyu@yahoo.com] 
Sent: Friday, February 03, 2006 4:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: takes too long to remove a page from WEBDB


Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <sg...@media-style.com> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

RE: takes too long to remove a page from WEBDB

Posted by Fuad Efendi <fu...@efendi.ca>.

We have following code:

org.apache.nutch.parse.ParseOutputFormat.java
...
[94]	toUrl = urlNormalizer.normalize(toUrl);
[95]	toUrl = URLFilters.filter(toUrl);
...


It normalizes, then filters normalized URL, than writes it to /crawl_parse

In some cases normalized URL is not same as raw URL, and it is not filtered.


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Friday, February 03, 2006 10:53 PM
To: nutch-user@lucene.apache.org
Subject: RE: takes too long to remove a page from WEBDB


It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..........); }
...

- it doesn't help, I see some URLs from "filtered" hosts again...


-----Original Message-----
From: Keren Yu [mailto:kerenyu@yahoo.com] 
Sent: Friday, February 03, 2006 4:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: takes too long to remove a page from WEBDB


Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <sg...@media-style.com> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

RE: takes too long to remove a page from WEBDB

Posted by Fuad Efendi <fu...@efendi.ca>.

We have following code:

org.apache.nutch.parse.ParseOutputFormat.java
...
[94]	toUrl = urlNormalizer.normalize(toUrl);
[95]	toUrl = URLFilters.filter(toUrl);
...


It normalizes, then filters normalized URL, than writes it to /crawl_parse

In some cases normalized URL is not same as raw URL, and it is not filtered.


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Friday, February 03, 2006 10:53 PM
To: nutch-user@lucene.apache.org
Subject: RE: takes too long to remove a page from WEBDB


It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..........); }
...

- it doesn't help, I see some URLs from "filtered" hosts again...


-----Original Message-----
From: Keren Yu [mailto:kerenyu@yahoo.com] 
Sent: Friday, February 03, 2006 4:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: takes too long to remove a page from WEBDB


Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <sg...@media-style.com> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

RE: takes too long to remove a page from WEBDB

Posted by Fuad Efendi <fu...@efendi.ca>.

It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..........); }
...

- it doesn't help, I see some URLs from "filtered" hosts again...


-----Original Message-----
From: Keren Yu [mailto:kerenyu@yahoo.com] 
Sent: Friday, February 03, 2006 4:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: takes too long to remove a page from WEBDB


Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <sg...@media-style.com> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: takes too long to remove a page from WEBDB

Posted by Keren Yu <ke...@yahoo.com>.

Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <sg...@media-style.com> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: takes too long to remove a page from WEBDB

Posted by Stefan Groschupf <sg...@media-style.com>.

not if you filter it in the url filter.
There is a database based url filter I think in the jira somewhere  
somehow, this can help to filter larger lists of urls.

Am 03.02.2006 um 21:35 schrieb Keren Yu:

> Hi Stefan,
>
> Thank you. You are right. I have to use a url filter
> and remove it from the index. But after 30 days later,
> the page will be generated again in generating fetch
> list.
>
> Thanks,
> Keren
>
> --- Stefan Groschupf <sg...@media-style.com> wrote:
>
>> And also it makes no sense, since it will come back
>> as soon the link
>> is found on a page.
>> Use a url filter instead  and remove it from the
>> index.
>> Removing from webdb makes no sense.
>>
>> Am 03.02.2006 um 21:27 schrieb Keren Yu:
>>
>>> Hi everyone,
>>>
>>> It took about 10 minutes to remove a page from
>> WEBDB
>>> using WebDBWriter. Does anyone know other method
>> to
>>> remove a page, which is faster.
>>>
>>> Thanks,
>>> Keren
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam
>> protection around
>>> http://mail.yahoo.com
>>>
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: takes too long to remove a page from WEBDB

Posted by Keren Yu <ke...@yahoo.com>.

Hi Stefan,

Thank you. You are right. I have to use a url filter
and remove it from the index. But after 30 days later,
the page will be generated again in generating fetch
list.

Thanks,
Keren

--- Stefan Groschupf <sg...@media-style.com> wrote:

> And also it makes no sense, since it will come back
> as soon the link  
> is found on a page.
> Use a url filter instead  and remove it from the
> index.
> Removing from webdb makes no sense.
> 
> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> 
> > Hi everyone,
> >
> > It took about 10 minutes to remove a page from
> WEBDB
> > using WebDBWriter. Does anyone know other method
> to
> > remove a page, which is faster.
> >
> > Thanks,
> > Keren
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com