You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ad...@interfree.it on 2005/09/14 16:54:28 UTC
crawl-urlfilter.txt
Hi,
thank you for your hints but I didn' give you the following information:
I modified the file crawl-urlfilter.txt in this mode:
#start crawl-urlfilter
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# accept anything else
+.
#end crawl-urlfilter
I started nutch with this line_command :
bin/nutch crawl urls -dir /home/paul/nutch-searcher.dir -depth 3 >& crawl.log
In the file "urls" there is the url of the following page:
<HTML>
<HEAD>
<TITLE> TitleOfSite </TITLE>
</HEAD>
<FRAMESET ROWS="14%, *">
<FRAME NORESIZE NAME="MENU" SRC="MyServlet?menu=1" SCROLLING =AUTO">
<FRAME NAME="PAGE" SRC="../welcome.html" SCROLLING=AUTO">
</FRAMESET>
</HTML>
Nutch crawls and fetchs "welcome.html" but doesn't work with MyServlet?menu=1
The servlet "MyServlet?menu=1" shows some links but in the log nutch doesn't
fetch any of those links.
I hope the question is clear and am looking forward to receiving your answer.
Adriano
-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:
- Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle
email a soli 18,59 euro
- MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
a soli 51,13 euro
Vieni a trovarci!
Lo Staff di Interfree
-------------------------------------------------------------------------