You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mu xiaofeng <he...@gmail.com> on 2005/09/08 08:50:58 UTC
How can I use Nutch 0.7 to crawl the Dynamic news?
hi ,
I'm use Nutch 0.7 crawler to fetch my site ,
but it only fetch the static html files like :
xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
How can I use it to fetch the dynamic news
ex: http://mysite.com/news.asp?id=12345 .?
my crawl-urlfilter.txt content is
-----------------------------------------
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# accept hosts in MY.DOMAIN.NAME
+^http://mysite.com/
# skip everything else
-.
-----------------------------------------
Thx all,
Re: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Posted by mu xiaofeng <he...@gmail.com>.
hi ,all,
I'm sorry for my bad description ,I knew that nutch can index text/html files,
my problem is ,The Nutch crawler only fetch the url like
http://mysite.com/test_sample/test.html , It skipped all the urls like
http://mysite.com/test_news/news.asp?newsid=123xx ,
How can I make it to fetch these url ?
2005/9/8, Robert.Guggenberger@wuestenrot.at <Ro...@wuestenrot.at>:
> hi,
>
> sorry it was my fault.
>
> Of course nutch indexes all URLs, pages and reads the file as text/html.
> So you are right :-)
>
> I'm quite new to nutch (first day was yesterday :-).
>
> regards
> robert
>
>
>
>
> Sébastien LE CALLONNEC <sl...@yahoo.ie>
>
> 08.09.2005 12:35
> Bitte antworten an nutch-user
>
> An: nutch-user@lucene.apache.org
> Kopie:
> Thema: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the
> Dynamic news?
>
>
> Hi,
>
>
> I am not too sure what you're saying... The ASP pages may be built
> from data pulled out from a database, but at the end of the day, what
> the browser displays is of text/html content-type, which can be indexed
> by Nutch.
>
> Or is your question related to another matter altogether?
>
>
> Regards,
> Sebastien.
>
> --- Robert.Guggenberger@wuestenrot.at a écrit :
>
> > hi,
> >
> > i think the problem is that the content comes from a database and not
> > from
> > a file?
> > So the question is how to index a databse with nutch?
> >
> > regards,
> > robert
> >
> >
> >
> >
> >
> > Sébastien LE CALLONNEC <sl...@yahoo.ie>
> >
> > 08.09.2005 10:46
> > Bitte antworten an nutch-user
> >
> > An: nutch-user@lucene.apache.org, hetao3@gmail.com
> > Kopie:
> > Thema: RE: How can I use Nutch 0.7 to crawl the Dynamic
> > news?
> >
> >
> > Hi,
> >
> > You need to remove the '?' and the '=' from the following pattern:
> > -[?*!@=]
> >
> > Regards,
> > Sebastien.
> >
> >
> > --- mu xiaofeng <he...@gmail.com> a écrit :
> >
> > > hi ,
> > >
> > > I'm use Nutch 0.7 crawler to fetch my site ,
> > > but it only fetch the static html files like :
> > > xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
> > >
> > > How can I use it to fetch the dynamic news
> > > ex: http://mysite.com/news.asp?id=12345 .?
> > > my crawl-urlfilter.txt content is
> > > -----------------------------------------
> > > # The url filter file used by the crawl command.
> > >
> > > # Better for intranet crawling.
> > > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > >
> > > # Each non-comment, non-blank line contains a regular expression
> > > # prefixed by '+' or '-'. The first matching pattern in the file
> > > # determines whether a URL is included or ignored. If no pattern
> > > # matches, the URL is ignored.
> > >
> > > # skip file:, ftp:, & mailto: urls
> > > -^(file|ftp|mailto):
> > >
> > > # skip image and other suffixes we can't yet parse
> > >
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> > >
> > > # skip URLs containing certain characters as probable queries, etc.
> > > -[?*!@=]
> > >
> > > # accept hosts in MY.DOMAIN.NAME
> > > +^http://mysite.com/
> > >
> > > # skip everything else
> > > -.
> > > -----------------------------------------
> > >
> > > Thx all,
> > >
> >
> >
> >
> >
> >
> >
> >
> >
> ___________________________________________________________________________
> >
> >
> > Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> > Messenger
> >
> > Téléchargez cette version sur http://fr.messenger.yahoo.com
> >
> >
> >
>
>
>
>
>
>
>
> ___________________________________________________________________________
>
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
>
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>
>
>
Antwort: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Posted by Ro...@wuestenrot.at.
hi,
sorry it was my fault.
Of course nutch indexes all URLs, pages and reads the file as text/html.
So you are right :-)
I'm quite new to nutch (first day was yesterday :-).
regards
robert
Sébastien LE CALLONNEC <sl...@yahoo.ie>
08.09.2005 12:35
Bitte antworten an nutch-user
An: nutch-user@lucene.apache.org
Kopie:
Thema: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the
Dynamic news?
Hi,
I am not too sure what you're saying... The ASP pages may be built
from data pulled out from a database, but at the end of the day, what
the browser displays is of text/html content-type, which can be indexed
by Nutch.
Or is your question related to another matter altogether?
Regards,
Sebastien.
--- Robert.Guggenberger@wuestenrot.at a écrit :
> hi,
>
> i think the problem is that the content comes from a database and not
> from
> a file?
> So the question is how to index a databse with nutch?
>
> regards,
> robert
>
>
>
>
>
> Sébastien LE CALLONNEC <sl...@yahoo.ie>
>
> 08.09.2005 10:46
> Bitte antworten an nutch-user
>
> An: nutch-user@lucene.apache.org, hetao3@gmail.com
> Kopie:
> Thema: RE: How can I use Nutch 0.7 to crawl the Dynamic
> news?
>
>
> Hi,
>
> You need to remove the '?' and the '=' from the following pattern:
> -[?*!@=]
>
> Regards,
> Sebastien.
>
>
> --- mu xiaofeng <he...@gmail.com> a écrit :
>
> > hi ,
> >
> > I'm use Nutch 0.7 crawler to fetch my site ,
> > but it only fetch the static html files like :
> > xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
> >
> > How can I use it to fetch the dynamic news
> > ex: http://mysite.com/news.asp?id=12345 .?
> > my crawl-urlfilter.txt content is
> > -----------------------------------------
> > # The url filter file used by the crawl command.
> >
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'. The first matching pattern in the file
> > # determines whether a URL is included or ignored. If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://mysite.com/
> >
> > # skip everything else
> > -.
> > -----------------------------------------
> >
> > Thx all,
> >
>
>
>
>
>
>
>
>
___________________________________________________________________________
>
>
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger
>
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>
>
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com
RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi,
I am not too sure what you're saying... The ASP pages may be built
from data pulled out from a database, but at the end of the day, what
the browser displays is of text/html content-type, which can be indexed
by Nutch.
Or is your question related to another matter altogether?
Regards,
Sebastien.
--- Robert.Guggenberger@wuestenrot.at a écrit :
> hi,
>
> i think the problem is that the content comes from a database and not
> from
> a file?
> So the question is how to index a databse with nutch?
>
> regards,
> robert
>
>
>
>
>
> Sébastien LE CALLONNEC <sl...@yahoo.ie>
>
> 08.09.2005 10:46
> Bitte antworten an nutch-user
>
> An: nutch-user@lucene.apache.org, hetao3@gmail.com
> Kopie:
> Thema: RE: How can I use Nutch 0.7 to crawl the Dynamic
> news?
>
>
> Hi,
>
> You need to remove the '?' and the '=' from the following pattern:
> -[?*!@=]
>
> Regards,
> Sebastien.
>
>
> --- mu xiaofeng <he...@gmail.com> a écrit :
>
> > hi ,
> >
> > I'm use Nutch 0.7 crawler to fetch my site ,
> > but it only fetch the static html files like :
> > xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
> >
> > How can I use it to fetch the dynamic news
> > ex: http://mysite.com/news.asp?id=12345 .?
> > my crawl-urlfilter.txt content is
> > -----------------------------------------
> > # The url filter file used by the crawl command.
> >
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'. The first matching pattern in the file
> > # determines whether a URL is included or ignored. If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://mysite.com/
> >
> > # skip everything else
> > -.
> > -----------------------------------------
> >
> > Thx all,
> >
>
>
>
>
>
>
>
>
___________________________________________________________________________
>
>
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger
>
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>
>
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com
Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Posted by Ro...@wuestenrot.at.
hi,
i think the problem is that the content comes from a database and not from
a file?
So the question is how to index a databse with nutch?
regards,
robert
Sébastien LE CALLONNEC <sl...@yahoo.ie>
08.09.2005 10:46
Bitte antworten an nutch-user
An: nutch-user@lucene.apache.org, hetao3@gmail.com
Kopie:
Thema: RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Hi,
You need to remove the '?' and the '=' from the following pattern:
-[?*!@=]
Regards,
Sebastien.
--- mu xiaofeng <he...@gmail.com> a écrit :
> hi ,
>
> I'm use Nutch 0.7 crawler to fetch my site ,
> but it only fetch the static html files like :
> xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
>
> How can I use it to fetch the dynamic news
> ex: http://mysite.com/news.asp?id=12345 .?
> my crawl-urlfilter.txt content is
> -----------------------------------------
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://mysite.com/
>
> # skip everything else
> -.
> -----------------------------------------
>
> Thx all,
>
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com
RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi,
You need to remove the '?' and the '=' from the following pattern:
-[?*!@=]
Regards,
Sebastien.
--- mu xiaofeng <he...@gmail.com> a écrit :
> hi ,
>
> I'm use Nutch 0.7 crawler to fetch my site ,
> but it only fetch the static html files like :
> xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
>
> How can I use it to fetch the dynamic news
> ex: http://mysite.com/news.asp?id=12345 .?
> my crawl-urlfilter.txt content is
> -----------------------------------------
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://mysite.com/
>
> # skip everything else
> -.
> -----------------------------------------
>
> Thx all,
>
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com