You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by nutch_newbie <ka...@hotmail.com> on 2008/10/05 21:29:26 UTC

Re: Nutch and its Growing Capabilities

here it is:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*healthline.com/
+^http://([a-z0-9]*\.)*healthfind.com/
+^http://([a-z0-9]*\.)*omnimedicalsearch.com/
+^http://([a-z0-9]*\.)*nih.gov
+^http://([a-z0-9]*\.)*cdc.gov/
+^http://([a-z0-9]*\.)*cancer.gov
+^http://([a-z0-9]*\.)*medpagetoday.com/
+^http://([a-z0-9]*\.)*fda.gov
+^http://([a-z0-9]*\.)*ovid.com
+^http://([a-z0-9]*\.)*intute.ac.uk
+^http://([a-z0-9]*\.)*guideline.gov
+^http://([a-z0-9]*\.)*jwatch.org
+^http://([a-z0-9]*\.)*clinicaltrials.gov
+^http://([a-z0-9]*\.)*centerwatch.com
+^http://([a-z0-9]*\.)*eMedicine.com
+^http://([a-z0-9]*\.)*rxlist.com
+^http://([a-z0-9]*\.)*oncolink.com
+^http://([a-z0-9]*\.)*omnimedicalsearch.com
+^http://([a-z0-9]*\.)*mwsearch.com/
+^http://([a-z0-9]*\.)*hon.ch/MedHunt/
+^http://([a-z0-9]*\.)*medicinenet.com
+^http://([a-z0-9]*\.)*webmd.com/
+^http://([a-z0-9]*\.)*medlineplus.gov/
+^http://([a-z0-9]*\.)*emedisearch.com
+^http://([a-z0-9]*\.)*diabetes-experts.com
+^http://([a-z0-9]*\.)*obesity-experts.com
+^http://([a-z0-9]*\.)*insomnia-treatment101.com
+^http://([a-z0-9]*\.)*bursitis101.com
+^http://([a-z0-9]*\.)*prostate-experts.com
+^http://([a-z0-9]*\.)*cystic-fibrosis101.com
+^http://([a-z0-9]*\.)*acid-reflux101.com
+^http://([a-z0-9]*\.)*addiction-treatment101.com
+^http://([a-z0-9]*\.)*medicalndx.com/
+^http://([a-z0-9]*\.)*mwsearch.com
+^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed
+^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/
+^http://([a-z0-9]*\.)*health.flexfinder.com
+^http://([a-z0-9]*\.)*medic8.com
+^http://([a-z0-9]*\.)*healthatoz.com
+^http://([a-z0-9]*\.)*kmle.com
+^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/
+^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/
+^http://([a-z0-9]*\.)*HealthAtoZ.com/
+^http://([a-z0-9]*\.)*healthfinder.gov 
+^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.
+^http://([a-z0-9]*\.)*mdlinx.com
+^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.html#medical
+^http://([a-z0-9]*\.)*hon.ch
+^http://([a-z0-9]*\.)*medbioworld.com
+^http://([a-z0-9]*\.)*medlineplus.gov
+^http://([a-z0-9]*\.)*medscape.com
+^http://([a-z0-9]*\.)*scirus.com
+^http://([a-z0-9]*\.)*metacrawler.com
+^http://([a-z0-9]*\.)*vivisimo.com/
+^http://([a-z0-9]*\.)*livegrandrounds.com
+^http://([a-z0-9]*\.)*nlm.nih.gov/
+^http://([a-z0-9]*\.)*nih.gov/
+^http://([a-z0-9]*\.)*os.dhhs.gov/
+^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/
+^http://([a-z0-9]*\.)*emedicine.com/EMERG/
+^http://([a-z0-9]*\.)*emedmag.com/
+^http://([a-z0-9]*\.)*aep.org/
+^http://([a-z0-9]*\.)*aaem.org/
+^http://([a-z0-9]*\.)*abem.org/public/
+^http://([a-z0-9]*\.)*ncemi.org/
+^http://([a-z0-9]*\.)*embbs.com
+^http://([a-z0-9]*\.)*emedhome.com
+^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/ 
+^http://([a-z0-9]*\.)*emj.bmj.com/
+^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml
# skip everything else
-.

and here is another version that i tried:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*\S* 

# skip everything else
-.
 


-- 
View this message in context: http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch and its Growing Capabilities

Posted by Kevin MacDonald <ke...@hautesecure.com>.
The second version looks like it should work. I would look at
Fetcher.handleRedirect() and put extra log lines around both the normalizers
and the urlfilters. It's possible that one of those is filtering out urls
that you expect to have crawled. I don't use nutch in the same way you do so
I can't offer more advice than that. Good luck.

On Sun, Oct 5, 2008 at 12:29 PM, nutch_newbie <ka...@hotmail.com>wrote:

>
> here it is:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*healthline.com/
> +^http://([a-z0-9]*\.)*healthfind.com/
> +^http://([a-z0-9]*\.)*omnimedicalsearch.com/
> +^http://([a-z0-9]*\.)*nih.gov
> +^http://([a-z0-9]*\.)*cdc.gov/
> +^http://([a-z0-9]*\.)*cancer.gov
> +^http://([a-z0-9]*\.)*medpagetoday.com/
> +^http://([a-z0-9]*\.)*fda.gov
> +^http://([a-z0-9]*\.)*ovid.com
> +^http://([a-z0-9]*\.)*intute.ac.uk
> +^http://([a-z0-9]*\.)*guideline.gov
> +^http://([a-z0-9]*\.)*jwatch.org
> +^http://([a-z0-9]*\.)*clinicaltrials.gov
> +^http://([a-z0-9]*\.)*centerwatch.com
> +^http://([a-z0-9]*\.)*eMedicine.com
> +^http://([a-z0-9]*\.)*rxlist.com
> +^http://([a-z0-9]*\.)*oncolink.com
> +^http://([a-z0-9]*\.)*omnimedicalsearch.com
> +^http://([a-z0-9]*\.)*mwsearch.com/
> +^http://([a-z0-9]*\.)*hon.ch/MedHunt/
> +^http://([a-z0-9]*\.)*medicinenet.com
> +^http://([a-z0-9]*\.)*webmd.com/
> +^http://([a-z0-9]*\.)*medlineplus.gov/
> +^http://([a-z0-9]*\.)*emedisearch.com
> +^http://([a-z0-9]*\.)*diabetes-experts.com
> +^http://([a-z0-9]*\.)*obesity-experts.com
> +^http://([a-z0-9]*\.)*insomnia-treatment101.com
> +^http://([a-z0-9]*\.)*bursitis101.com
> +^http://([a-z0-9]*\.)*prostate-experts.com
> +^http://([a-z0-9]*\.)*cystic-fibrosis101.com
> +^http://([a-z0-9]*\.)*acid-reflux101.com
> +^http://([a-z0-9]*\.)*addiction-treatment101.com
> +^http://([a-z0-9]*\.)*medicalndx.com/
> +^http://([a-z0-9]*\.)*mwsearch.com
> +^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed
> +^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/
> +^http://([a-z0-9]*\.)*health.flexfinder.com
> +^http://([a-z0-9]*\.)*medic8.com
> +^http://([a-z0-9]*\.)*healthatoz.com
> +^http://([a-z0-9]*\.)*kmle.com
> +^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/
> +^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/
> +^http://([a-z0-9]*\.)*HealthAtoZ.com/
> +^http://([a-z0-9]*\.)*healthfinder.gov
> +^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.
> +^http://([a-z0-9]*\.)*mdlinx.com
> +^http://([a-z0-9]*\.)*
> unmc.edu/library/education/internet/medsearch.html#medical
> +^http://([a-z0-9]*\.)*hon.ch
> +^http://([a-z0-9]*\.)*medbioworld.com
> +^http://([a-z0-9]*\.)*medlineplus.gov
> +^http://([a-z0-9]*\.)*medscape.com
> +^http://([a-z0-9]*\.)*scirus.com
> +^http://([a-z0-9]*\.)*metacrawler.com
> +^http://([a-z0-9]*\.)*vivisimo.com/
> +^http://([a-z0-9]*\.)*livegrandrounds.com
> +^http://([a-z0-9]*\.)*nlm.nih.gov/
> +^http://([a-z0-9]*\.)*nih.gov/
> +^http://([a-z0-9]*\.)*os.dhhs.gov/
> +^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/
> +^http://([a-z0-9]*\.)*emedicine.com/EMERG/
> +^http://([a-z0-9]*\.)*emedmag.com/
> +^http://([a-z0-9]*\.)*aep.org/
> +^http://([a-z0-9]*\.)*aaem.org/
> +^http://([a-z0-9]*\.)*abem.org/public/
> +^http://([a-z0-9]*\.)*ncemi.org/
> +^http://([a-z0-9]*\.)*embbs.com
> +^http://([a-z0-9]*\.)*emedhome.com
> +^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/
> +^http://([a-z0-9]*\.)*emj.bmj.com/
> +^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml
> # skip everything else
> -.
>
> and here is another version that i tried:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*\S*
>
> # skip everything else
> -.
>
>
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Extensive web crawl

Posted by Webmaster <we...@axismedia.ca>.
Ok..

So I want to index the web..  All of it..

Any thoughts on how to automate this so I can just point the spider off on
it's merry way and have it return 20 billion pages?

So far I've been injecting random portions of the DMOZ mixed with other urls
like directory.yahoo.com and wiki.org.  I was hoping this would give me a
good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced
with *.* --  Perhaps this is my error and that should be left as is and the
last line should be +. instead of -. ?

Anyhow after injecting 2000 urls and a few of my own I still only get back
minimal results in the range of 500 to 600k urls.

Right now I have a new grawl going with 1 million injected urls from the
DMOZ, I'm thinking that this should return a 20 million page index at
least..  No?

Anyhow..  I have more HD space on the way and would like to get the indexing
up to 1 billion by the end of the week..

Any examples on how to set up the url-filter.txt and regex-filter.txt would
be helpful..

Thanks..

Axel..