You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Naess, Ronny" <Ro...@avinor.no> on 2007/05/16 15:34:50 UTC

Regex-urlfilter

I am running Nutch 0.9 and I have a website where some of the urls
should be ommited.

I have added the following exceptions in regex-urlfilter.txt
-.*forside$
-.*frontpage$
-.*/js/.*
-.*/resources/.*
-.*/text/.*
-.*sdc.arena.no.*
-.*error.*
-.*/framework/.*
-.*/tridion.*
-.*/sitemap.*
-.*/nettsidekart.*
-.*/airport/.*/airports.*
-.*/lufthavn/.*/lufthavner.*
-.*://$

I have testet the filter by running this command 
$  cat /cygdrive/c/tmp/nutch_urls | bin/nutch
org/apache/nutch/net/URLFilterChecker -filterName
org.apache.nutch.urlfil
ter.regex.RegexURLFilter |grep -e -http
-http://sgm634.lv.no:13101/avinor/text/javascript
-http://sgm634.lv.no:13101/avinor/sdc.arena.no
-http://sgm634.lv.no:13101/framework/skins/avinor/js/showHide.js
-http://sgm634.lv.no:13101/avinor/://
-http://sgm634.lv.no:13101/framework/skins/avinor/js/dojo.js
-http://sgm634.lv.no:13101/framework/skins/avinor/js/common.js
-http://sgm634.lv.no:13101/avinor/text/css

And as you can se it filters out the stuff I don not want.

The only problem is that whenever I run the nutch crawl command or if I
recrawl the urls seems to pop up after all.

Example snibbit (the ones that should not be there are marked with a
minus in front):
fetching http://sgm634.lv.no:13101/avinor/trafikk
fetching http://sgm634.lv.no:13101/lufthavn/gressholmen
fetching http://sgm634.lv.no:13101/lufthavn/namsos
fetching http://sgm634.lv.no:13101/lufthavn/haugesund
fetching http://sgm634.lv.no:13101/lufthavn/rost
fetching http://sgm634.lv.no:13101/lufthavn/rorvik
fetching http://sgm634.lv.no:13101/lufthavn/lista
fetching http://sgm634.lv.no:13101/lufthavn/kristiansand
fetching http://sgm634.lv.no:13101/avinor/karriere
fetching http://sgm634.lv.no:13101/lufthavn/bardufoss
fetching http://sgm634.lv.no:13101/lufthavn/kirkenes
fetching http://sgm634.lv.no:13101/lufthavn/harstad
fetching http://sgm634.lv.no:13101/lufthavn/stokmarknes
fetching http://sgm634.lv.no:13101/lufthavn/lakselv
fetching http://sgm634.lv.no:13101/avinor/omavinor
fetching http://sgm634.lv.no:13101/lufthavn/fagernes
fetching http://sgm634.lv.no:13101/lufthavn/mehamn
fetching http://sgm634.lv.no:13101/avinor/rapporter
-fetching http://sgm634.lv.no:13101/avinor/text/javascript
fetching http://sgm634.lv.no:13101/lufthavn/stavanger
fetching http://sgm634.lv.no:13101/lufthavn/roros
fetching http://sgm634.lv.no:13101/lufthavn/sorkjosen
-fetching http://sgm634.lv.no:13101/avinor/sdc.arena.no
fetching http://sgm634.lv.no:13101/lufthavn/vardo
fetching http://sgm634.lv.no:13101/avinor/miljo
fetching http://sgm634.lv.no:13101/lufthavn/bronnoysund
fetching http://sgm634.lv.no:13101/avinor/sikkerhet
fetching http://sgm634.lv.no:13101/avinor/omavinor/Kontakt oss
fetching http://sgm634.lv.no:13101/lufthavn/hammerfest
fetching http://sgm634.lv.no:13101/avinor/sporsmal
fetching http://sgm634.lv.no:13101/lufthavn/sogndal
-fetching http://sgm634.lv.no:13101/avinor/://
fetching http://sgm634.lv.no:13101/lufthavn/bodo
fetching http://sgm634.lv.no:13101/lufthavn/vadso
fetching http://sgm634.lv.no:13101/lufthavn/sandnessjoen
fetching http://sgm634.lv.no:13101/lufthavn/narvik
fetching http://sgm634.lv.no:13101/lufthavn/honningsvag
-fetching http://sgm634.lv.no:13101/avinor/text/css
fetching http://sgm634.lv.no:13101/lufthavn/alesund
fetching http://sgm634.lv.no:13101/lufthavn/varoy
fetching http://sgm634.lv.no:13101/lufthavn/andoya
fetching http://sgm634.lv.no:13101/lufthavn/trondheim
fetching http://sgm634.lv.no:13101/avinor/forside
fetching http://sgm634.lv.no:13101/lufthavn/tromso
fetching http://sgm634.lv.no:13101/lufthavn/sandane
fetching http://sgm634.lv.no:13101/lufthavn/kristiansund
fetching http://sgm634.lv.no:13101/avinor/pressesenter
fetching http://sgm634.lv.no:13101/lufthavn/leknes
fetching http://sgm634.lv.no:13101/lufthavn/floro
fetching http://sgm634.lv.no:13101/avinor/lufthavner
fetching http://sgm634.lv.no:13101/lufthavn/moirana


Can anyone pleas tell me what am I doing wrong?

It struck me that I might be using the wrong file and that all regex
exceptions should be in crawl-urlfilter.txt, but I do not thing that is
correct.

Thanks, 

Ronny



-----Opprinnelig melding-----
Fra: Naess, Ronny [mailto:Ronny.Naess@avinor.no] 
Sendt: 16. mai 2007 15:18
Til: nutch-user@lucene.apache.org
Emne: Re: Reindex and initialization

I found this script and motified it slightly.

http://wiki.apache.org/nutch/IntranetRecrawl#head-b16709cbbd77ae6c80d742
ee69383142cefb8683

The script takes care of the reinstallation of the index by touching the
web.xml file in the webapp. 
Doing this reloads the whole webapp, doesn't it? Is that the only way to
reload the index, by reloading the webapp?

-Ronny


-----Opprinnelig melding-----
Fra: Naess, Ronny [mailto:Ronny.Naess@avinor.no]
Sendt: 15. mai 2007 12:13
Til: nutch-user@lucene.apache.org
Emne: Re: Reindex and initialization

It showed that I had some issues with jdk versions after all. I added
NUTCH_JAVA_HOME pointing at jdk 1.5 and that seemed to do the trick.

Some other issus popped up but I recon that has to do with me using
nutch 0.9.

Still wondering about the initialization of index when client is
running. I do not want to restart the webclient.

-Ronny

-----Opprinnelig melding-----
Fra: Naess, Ronny [mailto:Ronny.Naess@avinor.no]
Sendt: 15. mai 2007 10:26
Til: nutch-user@lucene.apache.org
Emne: Reindex and initialization

 
Hi.

I want to have a script for reindexing. I copied the recrawl script made
by the author of this tutorial
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htm
l
I am running into some trouble with UnsupportedClassVersionError
(Unsupported major.minor version 49.0). I have tried both java 1.5 and
1.4. I am using Nutch 0.9. 

I there are other or better ways to reindex I will be happy for any
hints or help in that area.

Also, is the problem with reinit of new index still a problem as earlier
(0.7) where one solution was to reinit the webclient. Restart of
webclient is not an option for us since we must have high
uptime/availiability. Does anyone know if this is fixed or if there is a
solution for reinit in nutch 0.9?

-Ronny







!DSPAM:464b04bd173231550420230!


Re: Regex-urlfilter

Posted by Sami Siren <ss...@gmail.com>.
Naess, Ronny wrote:
> Can anyone pleas tell me what am I doing wrong?
>
> It struck me that I might be using the wrong file and that all regex
> exceptions should be in crawl-urlfilter.txt, but I do not thing that is
> correct.
>
>   
Yes when using the crawl command you should use crawl-urlfilter.xml or
configure crawl to use regex-urlfilter.xml via crawl-tool.xml.

-- 
 Sami Siren