You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mu xiaofeng <he...@gmail.com> on 2005/09/08 08:50:58 UTC

How can I use Nutch 0.7 to crawl the Dynamic news?

hi ,

I'm use Nutch 0.7 crawler to fetch my site ,
but it only fetch the static html files like :
xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js

How can I use it to fetch the dynamic news
ex: http://mysite.com/news.asp?id=12345  .?
my crawl-urlfilter.txt content is
-----------------------------------------
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME
+^http://mysite.com/

# skip everything else
-.
-----------------------------------------

Thx all,

Re: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Posted by mu xiaofeng <he...@gmail.com>.
hi ,all,

I'm sorry for my bad description ,I knew that nutch can index text/html files,

my problem is ,The Nutch crawler only fetch the url like
http://mysite.com/test_sample/test.html , It skipped all the urls like
http://mysite.com/test_news/news.asp?newsid=123xx ,
How can I make it to fetch these url ? 

2005/9/8, Robert.Guggenberger@wuestenrot.at <Ro...@wuestenrot.at>:
> hi,
> 
> sorry it was my fault.
> 
> Of course nutch indexes all URLs, pages and reads the file as text/html.
> So you are right :-)
> 
> I'm quite new to nutch (first day was yesterday :-).
> 
> regards
> robert
> 
> 
> 
> 
> Sébastien LE CALLONNEC <sl...@yahoo.ie>
> 
> 08.09.2005 12:35
> Bitte antworten an nutch-user
> 
>        An:     nutch-user@lucene.apache.org
>        Kopie:
>        Thema:  RE: Antwort: RE: How can I use Nutch 0.7 to crawl the
> Dynamic news?
> 
> 
> Hi,
> 
> 
> I am not too sure what you're saying...  The ASP pages may be built
> from data pulled out from a database, but at the end of the day, what
> the browser displays is of text/html content-type, which can be indexed
> by Nutch.
> 
> Or is your question related to another matter altogether?
> 
> 
> Regards,
> Sebastien.
> 
> --- Robert.Guggenberger@wuestenrot.at a écrit :
> 
> > hi,
> >
> > i think the problem is that the content comes from a database and not
> > from
> > a file?
> > So the question is how to index a databse with nutch?
> >
> > regards,
> > robert
> >
> >
> >
> >
> >
> > Sébastien LE CALLONNEC <sl...@yahoo.ie>
> >
> > 08.09.2005 10:46
> > Bitte antworten an nutch-user
> >
> >         An:     nutch-user@lucene.apache.org, hetao3@gmail.com
> >         Kopie:
> >         Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic
> > news?
> >
> >
> > Hi,
> >
> > You need to remove the '?' and the '=' from the following pattern:
> > -[?*!@=]
> >
> > Regards,
> > Sebastien.
> >
> >
> > --- mu xiaofeng <he...@gmail.com> a écrit :
> >
> > > hi ,
> > >
> > > I'm use Nutch 0.7 crawler to fetch my site ,
> > > but it only fetch the static html files like :
> > > xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> > >
> > > How can I use it to fetch the dynamic news
> > > ex: http://mysite.com/news.asp?id=12345  .?
> > > my crawl-urlfilter.txt content is
> > > -----------------------------------------
> > > # The url filter file used by the crawl command.
> > >
> > > # Better for intranet crawling.
> > > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > >
> > > # Each non-comment, non-blank line contains a regular expression
> > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > > # determines whether a URL is included or ignored.  If no pattern
> > > # matches, the URL is ignored.
> > >
> > > # skip file:, ftp:, & mailto: urls
> > > -^(file|ftp|mailto):
> > >
> > > # skip image and other suffixes we can't yet parse
> > >
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> > >
> > > # skip URLs containing certain characters as probable queries, etc.
> > > -[?*!@=]
> > >
> > > # accept hosts in MY.DOMAIN.NAME
> > > +^http://mysite.com/
> > >
> > > # skip everything else
> > > -.
> > > -----------------------------------------
> > >
> > > Thx all,
> > >
> >
> >
> >
> >
> >
> >
> >
> >
> ___________________________________________________________________________
> >
> >
> > Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> > Messenger
> >
> > Téléchargez cette version sur http://fr.messenger.yahoo.com
> >
> >
> >
> 
> 
> 
> 
> 
> 
> 
> ___________________________________________________________________________
> 
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
> 
> Téléchargez cette version sur http://fr.messenger.yahoo.com
> 
> 
> 
>

Antwort: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Posted by Ro...@wuestenrot.at.
hi,

sorry it was my fault.

Of course nutch indexes all URLs, pages and reads the file as text/html. 
So you are right :-)

I'm quite new to nutch (first day was yesterday :-). 

regards
robert




Sébastien LE CALLONNEC <sl...@yahoo.ie>

08.09.2005 12:35
Bitte antworten an nutch-user
 
        An:     nutch-user@lucene.apache.org
        Kopie: 
        Thema:  RE: Antwort: RE: How can I use Nutch 0.7 to crawl the 
Dynamic news?


Hi,


I am not too sure what you're saying...  The ASP pages may be built
from data pulled out from a database, but at the end of the day, what
the browser displays is of text/html content-type, which can be indexed
by Nutch.

Or is your question related to another matter altogether?


Regards,
Sebastien.

--- Robert.Guggenberger@wuestenrot.at a écrit :

> hi,
> 
> i think the problem is that the content comes from a database and not
> from 
> a file?
> So the question is how to index a databse with nutch?
> 
> regards,
> robert
> 
> 
> 
> 
> 
> Sébastien LE CALLONNEC <sl...@yahoo.ie>
> 
> 08.09.2005 10:46
> Bitte antworten an nutch-user
> 
>         An:     nutch-user@lucene.apache.org, hetao3@gmail.com
>         Kopie: 
>         Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic
> news?
> 
> 
> Hi, 
> 
> You need to remove the '?' and the '=' from the following pattern:
> -[?*!@=]
> 
> Regards,
> Sebastien.
> 
> 
> --- mu xiaofeng <he...@gmail.com> a écrit :
> 
> > hi ,
> > 
> > I'm use Nutch 0.7 crawler to fetch my site ,
> > but it only fetch the static html files like :
> > xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> > 
> > How can I use it to fetch the dynamic news
> > ex: http://mysite.com/news.asp?id=12345  .?
> > my crawl-urlfilter.txt content is
> > -----------------------------------------
> > # The url filter file used by the crawl command.
> > 
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > 
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> > 
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> > 
> > # skip image and other suffixes we can't yet parse
> >
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> > 
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> > 
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://mysite.com/
> > 
> > # skip everything else
> > -.
> > -----------------------------------------
> > 
> > Thx all,
> > 
> 
> 
> 
> 
> 
> 
> 
>
___________________________________________________________________________
> 
> 
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger 
> 
> Téléchargez cette version sur http://fr.messenger.yahoo.com
> 
> 
> 



 

 
 
___________________________________________________________________________ 

Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 

Téléchargez cette version sur http://fr.messenger.yahoo.com



RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi,


I am not too sure what you're saying...  The ASP pages may be built
from data pulled out from a database, but at the end of the day, what
the browser displays is of text/html content-type, which can be indexed
by Nutch.

Or is your question related to another matter altogether?


Regards,
Sebastien.

--- Robert.Guggenberger@wuestenrot.at a écrit :

> hi,
> 
> i think the problem is that the content comes from a database and not
> from 
> a file?
> So the question is how to index a databse with nutch?
> 
> regards,
> robert
> 
> 
> 
> 
> 
> Sébastien LE CALLONNEC <sl...@yahoo.ie>
> 
> 08.09.2005 10:46
> Bitte antworten an nutch-user
>  
>         An:     nutch-user@lucene.apache.org, hetao3@gmail.com
>         Kopie: 
>         Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic
> news?
> 
> 
> Hi, 
> 
> You need to remove the '?' and the '=' from the following pattern:
> -[?*!@=]
> 
> Regards,
> Sebastien.
> 
> 
> --- mu xiaofeng <he...@gmail.com> a écrit :
> 
> > hi ,
> > 
> > I'm use Nutch 0.7 crawler to fetch my site ,
> > but it only fetch the static html files like :
> > xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> > 
> > How can I use it to fetch the dynamic news
> > ex: http://mysite.com/news.asp?id=12345  .?
> > my crawl-urlfilter.txt content is
> > -----------------------------------------
> > # The url filter file used by the crawl command.
> > 
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > 
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> > 
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> > 
> > # skip image and other suffixes we can't yet parse
> >
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> > 
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> > 
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://mysite.com/
> > 
> > # skip everything else
> > -.
> > -----------------------------------------
> > 
> > Thx all,
> > 
> 
> 
> 
>  
> 
>  
>  
>
___________________________________________________________________________
> 
> 
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger 
> 
> Téléchargez cette version sur http://fr.messenger.yahoo.com
> 
> 
> 



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Posted by Ro...@wuestenrot.at.
hi,

i think the problem is that the content comes from a database and not from 
a file?
So the question is how to index a databse with nutch?

regards,
robert





Sébastien LE CALLONNEC <sl...@yahoo.ie>

08.09.2005 10:46
Bitte antworten an nutch-user
 
        An:     nutch-user@lucene.apache.org, hetao3@gmail.com
        Kopie: 
        Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic news?


Hi, 

You need to remove the '?' and the '=' from the following pattern:
-[?*!@=]

Regards,
Sebastien.


--- mu xiaofeng <he...@gmail.com> a écrit :

> hi ,
> 
> I'm use Nutch 0.7 crawler to fetch my site ,
> but it only fetch the static html files like :
> xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> 
> How can I use it to fetch the dynamic news
> ex: http://mysite.com/news.asp?id=12345  .?
> my crawl-urlfilter.txt content is
> -----------------------------------------
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://mysite.com/
> 
> # skip everything else
> -.
> -----------------------------------------
> 
> Thx all,
> 



 

 
 
___________________________________________________________________________ 

Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 

Téléchargez cette version sur http://fr.messenger.yahoo.com



RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi, 

You need to remove the '?' and the '=' from the following pattern:
-[?*!@=]

Regards,
Sebastien.


--- mu xiaofeng <he...@gmail.com> a écrit :

> hi ,
> 
> I'm use Nutch 0.7 crawler to fetch my site ,
> but it only fetch the static html files like :
> xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> 
> How can I use it to fetch the dynamic news
> ex: http://mysite.com/news.asp?id=12345  .?
> my crawl-urlfilter.txt content is
> -----------------------------------------
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://mysite.com/
> 
> # skip everything else
> -.
> -----------------------------------------
> 
> Thx all,
> 



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com