You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Israel <we...@gmail.com> on 2010/08/19 03:02:09 UTC

Configure crawl-urlfilter file

Hello, I found a situation that generates error when I run the crawler, for
example in the url directory I have a *. txt in the interior contains:
http://www.opentechlearning.com/ and inside the folder conf file
'crawl-urlfilter' must have:
+ ^ Http:// ([a-z0-9] * \.) * Opentechlearning.com /

My question is how (I put the pages in the crawl-urlfilter file) to the
following pages:

http://cnx.org/lenses/ccotp/endorsements/atom

http://ocw.nd.edu/courselist/rss

http://openlearn.open.ac.uk/file.php/1/learningspace.xml

.... and not starting with www. and that causes me problems

Re: Configure crawl-urlfilter file

Posted by Israel <we...@gmail.com>.
i put this:
+^http://cnx.org/lenses/ccotp/endorsements/atom but
but when i do the search....nothing appears


Hi Peter, I read your tutorial for nutch installation, I installed it and
everything works great ... but I have a big doubt.

When I run the crawler, for example in the url directory I have a *. txt in
the interior contains:
http://www.opentechlearning.com/

And inside the folder 'conf' there are a file 'crawl-urlfilter' must have:
+ ^ Http:// ([a-z0-9] * \.) * Opentechlearning.com /

My question is how (I put the pages in the crawl-urlfilter file) to the
following pages:

http://cnx.org/lenses/ccotp/endorsements/atom

http://ocw.nd.edu/courselist/rss

http://openlearn.open.ac.uk/file.php/1/learningspace.xml

.... and not starting with www. and that causes me problems

I put for example:
http://cnx.org/lenses/ccotp/endorsements/atom
and
+ ^ Http:// ([a-z0-9] * \.) *cnx.org/lenses/ccotp/endorsements/atom

but when i do the search....nothing appears


+^http://cnx.org/lenses/ccotp/endorsements/atom

Re: Configure crawl-urlfilter file

Posted by 孙兆玉 <sz...@gmail.com>.
guess add three line as:

+^http://cnx.org/lenses/ccotp/endorsements/atom
+^http://ocw.nd.edu/courselist/rss
+^http://openlearn.open.ac.uk/file.php/1/learningspace.xml