You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Israel <we...@gmail.com> on 2010/08/19 03:02:09 UTC
Configure crawl-urlfilter file
Hello, I found a situation that generates error when I run the crawler, for
example in the url directory I have a *. txt in the interior contains:
http://www.opentechlearning.com/ and inside the folder conf file
'crawl-urlfilter' must have:
+ ^ Http:// ([a-z0-9] * \.) * Opentechlearning.com /
My question is how (I put the pages in the crawl-urlfilter file) to the
following pages:
http://cnx.org/lenses/ccotp/endorsements/atom
http://ocw.nd.edu/courselist/rss
http://openlearn.open.ac.uk/file.php/1/learningspace.xml
.... and not starting with www. and that causes me problems
Re: Configure crawl-urlfilter file
Posted by Israel <we...@gmail.com>.
i put this:
+^http://cnx.org/lenses/ccotp/endorsements/atom but
but when i do the search....nothing appears
Hi Peter, I read your tutorial for nutch installation, I installed it and
everything works great ... but I have a big doubt.
When I run the crawler, for example in the url directory I have a *. txt in
the interior contains:
http://www.opentechlearning.com/
And inside the folder 'conf' there are a file 'crawl-urlfilter' must have:
+ ^ Http:// ([a-z0-9] * \.) * Opentechlearning.com /
My question is how (I put the pages in the crawl-urlfilter file) to the
following pages:
http://cnx.org/lenses/ccotp/endorsements/atom
http://ocw.nd.edu/courselist/rss
http://openlearn.open.ac.uk/file.php/1/learningspace.xml
.... and not starting with www. and that causes me problems
I put for example:
http://cnx.org/lenses/ccotp/endorsements/atom
and
+ ^ Http:// ([a-z0-9] * \.) *cnx.org/lenses/ccotp/endorsements/atom
but when i do the search....nothing appears
+^http://cnx.org/lenses/ccotp/endorsements/atom
Re: Configure crawl-urlfilter file
Posted by 孙兆玉 <sz...@gmail.com>.
guess add three line as:
+^http://cnx.org/lenses/ccotp/endorsements/atom
+^http://ocw.nd.edu/courselist/rss
+^http://openlearn.open.ac.uk/file.php/1/learningspace.xml