You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ad...@interfree.it on 2005/10/01 16:46:11 UTC
Fwd: problem about the fetch of dinamic page
Hi, I have a question about nutch crawler:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I want to make a document search on a site one that has approached with authentication (user/password).
As soon as fact the login, the first page visualized from the composed application e' from two frame:
<HTML>
<HEAD>
<TITLE>Sistema Provvedimenti - SUPER</TITLE>
</HEAD> <FRAMESET ROWS="14%,*">
<FRAME NORESIZE NAME="MENU" SRC="Servlet1?menu=1" SCROLLING="AUTO">
<FRAME NAME="PAGE" SRC="../a.html" SCROLLING="AUTO">
</FRAMESET>
</HTML>
The servlet "Servlet1" publish on web a table with a 1 line and N columns,
where every column contains a href with the URL of an other servlet (a Servlet2-ServletN).
DESCRIPTION OF THE PROBLEM:
My problem is that I ago see that crawler make the fetch of the page of login, of the static page a.html, of servlet the Servlet1, but not ago fetch of no the other servlet (Servlet2-ServletN).
Instead if I put of the href in the page a.html, Nutch succeeds to make the fetch of the URL and works all.
DESCRIPTION OF OUR CONFIGURATION OF NUTCH:
I installed Nutch 0.6. I launch the nutch in this mode:
/usr/nutch-0.6/bin/nutch crawl url -dir index -depth 10 -threads 8 >&
crawl.log
where in the file "url" there is only the url of the sie with just the login and passw
I modified the file of configuration of Nutch "crawl-urlfilter.txt" like :
-^(ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$
+[?&=]
+.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Plese somebody help me!!! It is very important for me
Adriano Palombo
-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:
- Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle
email a soli 18,59 euro
- MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
a soli 51,13 euro
Vieni a trovarci!
Lo Staff di Interfree
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:
- Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle
email a soli 18,59 euro
- MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
a soli 51,13 euro
Vieni a trovarci!
Lo Staff di Interfree
-------------------------------------------------------------------------
http error: 400
Posted by Michael Ji <fj...@yahoo.com>.
hi,
when I do fetching, I got following error message.
Could anyone tell me what causes it? Is it due to
nutch configuration?
thanks,
Michael Ji
----------------------------
070601 152106 fetch of http://www.flmatchmaker.com/
,flmatchmaker.com/, 1, Flm
atchmaker , 3, 2 failed with: java.lang.Exception:
org.apache.nutch.protocol.htt
p.HttpError: HTTP Error: 400
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Re: Fwd: problem about the fetch of dinamic page
Posted by Piotr Kosiorowski <pk...@gmail.com>.
You can use "nutch readdb" command to check if urls you are interested
in where added to WebDB - if yes check the segments if they contain
these urls. Please review the logs from fetch to check if there was an
attempt to fetch from these urls (you might have some problem with
authentication). Right now the description is too generic for me to help
with more details.
Regards
Piotr
adriano50@interfree.it wrote:
> Hi, I have a question about nutch crawler:
>
>
>
> I want to make a document search on a site one that has approached with authentication (user/password).
> As soon as fact the login, the first page visualized from the composed application e' from two frame:
>
> <HTML>
> <HEAD>
> <TITLE>Sistema Provvedimenti - SUPER</TITLE>
> </HEAD> <FRAMESET ROWS="14%,*">
> <FRAME NORESIZE NAME="MENU" SRC="Servlet1?menu=1" SCROLLING="AUTO">
> <FRAME NAME="PAGE" SRC="../a.html" SCROLLING="AUTO">
> </FRAMESET>
> </HTML>
>
> The servlet "Servlet1" publish on web a table with a 1 line and N columns,
> where every column contains a href with the URL of an other servlet (a Servlet2-ServletN).
>
> DESCRIPTION OF THE PROBLEM:
>
> My problem is that I ago see that crawler make the fetch of the page of login, of the static page a.html, of servlet the Servlet1, but not ago fetch of no the other servlet (Servlet2-ServletN).
> Instead if I put of the href in the page a.html, Nutch succeeds to make the fetch of the URL and works all.
>
>
> DESCRIPTION OF OUR CONFIGURATION OF NUTCH:
> I installed Nutch 0.6. I launch the nutch in this mode:
> /usr/nutch-0.6/bin/nutch crawl url -dir index -depth 10 -threads 8 >&
> crawl.log
>
> where in the file "url" there is only the url of the sie with just the login and passw
>
> I modified the file of configuration of Nutch "crawl-urlfilter.txt" like :
>
> -^(ftp|mailto):
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
> ov|MOV|exe)$
> +[?&=]
> +.
>
>
> Plese somebody help me!!! It is very important for me
>
> Adriano Palombo
>
>
>
> -------------------------------------------------------------------------
> Visita http://domini.interfree.it, il sito di Interfree dove trovare
> soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
> ecco due esempi di offerte:
>
> - Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle
> email a soli 18,59 euro
> - MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
> a soli 51,13 euro
>
> Vieni a trovarci!
>
> Lo Staff di Interfree
> -------------------------------------------------------------------------
>
>
>
> -------------------------------------------------------------------------
> Visita http://domini.interfree.it, il sito di Interfree dove trovare
> soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
> ecco due esempi di offerte:
>
> - Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle
> email a soli 18,59 euro
> - MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email
> a soli 51,13 euro
>
> Vieni a trovarci!
>
> Lo Staff di Interfree
> -------------------------------------------------------------------------
>
>