You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ad...@interfree.it on 2005/09/14 16:54:28 UTC

crawl-urlfilter.txt

Hi,
thank you for your hints but I didn' give you the following information:

I modified the file crawl-urlfilter.txt in this mode:
#start crawl-urlfilter
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+.
#end crawl-urlfilter


I started nutch with this line_command :
bin/nutch crawl urls -dir /home/paul/nutch-searcher.dir -depth 3 >& crawl.log

In the file "urls" there is the url of the following page:

<HTML>

<HEAD>
<TITLE>  TitleOfSite </TITLE>
</HEAD>

<FRAMESET ROWS="14%, *">

<FRAME NORESIZE NAME="MENU" SRC="MyServlet?menu=1" SCROLLING =AUTO">

<FRAME NAME="PAGE"  SRC="../welcome.html" SCROLLING=AUTO">

</FRAMESET>

</HTML>


Nutch crawls and fetchs "welcome.html"  but doesn't work with MyServlet?menu=1
The servlet "MyServlet?menu=1"  shows some links but in the log  nutch doesn't 
fetch  any of those links.
I hope the question is clear and am looking forward to receiving your answer.

                                         Adriano

-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email 
   a soli 51,13 euro

Vieni a trovarci!

Lo Staff di Interfree 
-------------------------------------------------------------------------