You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Parini Gianni <gi...@gmail.com> on 2006/09/18 02:13:10 UTC

Nutch can't crawl web site redirected from port 80 to 8080?

I everybody,

     i'm an italian student(informatic engineering)
 and i'm making thesis on search engine.
At the end of that i'm trying to configure my 
university web server to work with nutch.
But i'm having some problem. 

All the site is build in java jsp/jspx and the server is a apache 
tomcat reacheble on port 8080.

the site is http://www.alice.unibo.it and the first page redirect the browser 
to http://www.alice.unibo.it:8080/index.jspx

The question's is that i need to index that web site,
 i configure nutch on the web server configuring:
-conf/nutch-site.xml
-url/nutch with http://www.alice.unibo.it
-conf/crawl-urlfilter.txt with alice.unibo.it for domain

And nutch cant't do that, i try 1000 way, putting the port in the filter domain,
putting the redirected url in the url/nutch file, 
modifying conf/nutch-default.xml 
plugin properties and other.

Nutch work perfectly with other domain without redirection, 
but with redirection to 8080 can't fetch pages.
I Hope to find a solution in time to end discuss my thesis.

Thank you all. 

Parini Gianni, Bologna, Italy



System env:
nutch 0.8
tomcat 5
java 1.5

System hardware:
MacOs 10.4.7 on Ibook G4