You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by nutch_newbie <ka...@hotmail.com> on 2008/09/21 21:05:27 UTC

Nutch and its Growing Capabilities

Hello everyone
I'm having some trouble with something in my nutch application, but have a
hard time finding it, let alone fixing it. So if anyone can help with any
suggestions, i would really appreciate it. 
My nutch is working just fine, it fetched and crawled, etc. and displays
results. Only its not growing. 
My understanding is that when it has a url, as time goes by, it also somehow
starts including links on that url... like if i give it samplewebsite.com,
and that website has a link on it  (ex: someotherwebsite.com), it will
eventlually display it during a search. so over time, instead of 6 results
per keyword, i would get 7 or 8 or 9, and it would continue growing. Well,
its not doing so. By the way, i set my nutch up for a whole-web crawling. So
is there some sort of trick or plugin? 
I've spent a good few months experimenting and trying different configs,
etc, but no luck.
I would really appreciate if someone could help me out.
Thank you in advance.
-- 
View this message in context: http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19597372.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch and its Growing Capabilities

Posted by Kevin MacDonald <ke...@hautesecure.com>.
The second version looks like it should work. I would look at
Fetcher.handleRedirect() and put extra log lines around both the normalizers
and the urlfilters. It's possible that one of those is filtering out urls
that you expect to have crawled. I don't use nutch in the same way you do so
I can't offer more advice than that. Good luck.

On Sun, Oct 5, 2008 at 12:29 PM, nutch_newbie <ka...@hotmail.com>wrote:

>
> here it is:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*healthline.com/
> +^http://([a-z0-9]*\.)*healthfind.com/
> +^http://([a-z0-9]*\.)*omnimedicalsearch.com/
> +^http://([a-z0-9]*\.)*nih.gov
> +^http://([a-z0-9]*\.)*cdc.gov/
> +^http://([a-z0-9]*\.)*cancer.gov
> +^http://([a-z0-9]*\.)*medpagetoday.com/
> +^http://([a-z0-9]*\.)*fda.gov
> +^http://([a-z0-9]*\.)*ovid.com
> +^http://([a-z0-9]*\.)*intute.ac.uk
> +^http://([a-z0-9]*\.)*guideline.gov
> +^http://([a-z0-9]*\.)*jwatch.org
> +^http://([a-z0-9]*\.)*clinicaltrials.gov
> +^http://([a-z0-9]*\.)*centerwatch.com
> +^http://([a-z0-9]*\.)*eMedicine.com
> +^http://([a-z0-9]*\.)*rxlist.com
> +^http://([a-z0-9]*\.)*oncolink.com
> +^http://([a-z0-9]*\.)*omnimedicalsearch.com
> +^http://([a-z0-9]*\.)*mwsearch.com/
> +^http://([a-z0-9]*\.)*hon.ch/MedHunt/
> +^http://([a-z0-9]*\.)*medicinenet.com
> +^http://([a-z0-9]*\.)*webmd.com/
> +^http://([a-z0-9]*\.)*medlineplus.gov/
> +^http://([a-z0-9]*\.)*emedisearch.com
> +^http://([a-z0-9]*\.)*diabetes-experts.com
> +^http://([a-z0-9]*\.)*obesity-experts.com
> +^http://([a-z0-9]*\.)*insomnia-treatment101.com
> +^http://([a-z0-9]*\.)*bursitis101.com
> +^http://([a-z0-9]*\.)*prostate-experts.com
> +^http://([a-z0-9]*\.)*cystic-fibrosis101.com
> +^http://([a-z0-9]*\.)*acid-reflux101.com
> +^http://([a-z0-9]*\.)*addiction-treatment101.com
> +^http://([a-z0-9]*\.)*medicalndx.com/
> +^http://([a-z0-9]*\.)*mwsearch.com
> +^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed
> +^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/
> +^http://([a-z0-9]*\.)*health.flexfinder.com
> +^http://([a-z0-9]*\.)*medic8.com
> +^http://([a-z0-9]*\.)*healthatoz.com
> +^http://([a-z0-9]*\.)*kmle.com
> +^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/
> +^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/
> +^http://([a-z0-9]*\.)*HealthAtoZ.com/
> +^http://([a-z0-9]*\.)*healthfinder.gov
> +^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.
> +^http://([a-z0-9]*\.)*mdlinx.com
> +^http://([a-z0-9]*\.)*
> unmc.edu/library/education/internet/medsearch.html#medical
> +^http://([a-z0-9]*\.)*hon.ch
> +^http://([a-z0-9]*\.)*medbioworld.com
> +^http://([a-z0-9]*\.)*medlineplus.gov
> +^http://([a-z0-9]*\.)*medscape.com
> +^http://([a-z0-9]*\.)*scirus.com
> +^http://([a-z0-9]*\.)*metacrawler.com
> +^http://([a-z0-9]*\.)*vivisimo.com/
> +^http://([a-z0-9]*\.)*livegrandrounds.com
> +^http://([a-z0-9]*\.)*nlm.nih.gov/
> +^http://([a-z0-9]*\.)*nih.gov/
> +^http://([a-z0-9]*\.)*os.dhhs.gov/
> +^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/
> +^http://([a-z0-9]*\.)*emedicine.com/EMERG/
> +^http://([a-z0-9]*\.)*emedmag.com/
> +^http://([a-z0-9]*\.)*aep.org/
> +^http://([a-z0-9]*\.)*aaem.org/
> +^http://([a-z0-9]*\.)*abem.org/public/
> +^http://([a-z0-9]*\.)*ncemi.org/
> +^http://([a-z0-9]*\.)*embbs.com
> +^http://([a-z0-9]*\.)*emedhome.com
> +^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/
> +^http://([a-z0-9]*\.)*emj.bmj.com/
> +^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml
> # skip everything else
> -.
>
> and here is another version that i tried:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*\S*
>
> # skip everything else
> -.
>
>
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Extensive web crawl

Posted by Webmaster <we...@axismedia.ca>.
Ok..

So I want to index the web..  All of it..

Any thoughts on how to automate this so I can just point the spider off on
it's merry way and have it return 20 billion pages?

So far I've been injecting random portions of the DMOZ mixed with other urls
like directory.yahoo.com and wiki.org.  I was hoping this would give me a
good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced
with *.* --  Perhaps this is my error and that should be left as is and the
last line should be +. instead of -. ?

Anyhow after injecting 2000 urls and a few of my own I still only get back
minimal results in the range of 500 to 600k urls.

Right now I have a new grawl going with 1 million injected urls from the
DMOZ, I'm thinking that this should return a 20 million page index at
least..  No?

Anyhow..  I have more HD space on the way and would like to get the indexing
up to 1 billion by the end of the week..

Any examples on how to set up the url-filter.txt and regex-filter.txt would
be helpful..

Thanks..

Axel..


Re: Nutch and its Growing Capabilities

Posted by nutch_newbie <ka...@hotmail.com>.
here it is:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*healthline.com/
+^http://([a-z0-9]*\.)*healthfind.com/
+^http://([a-z0-9]*\.)*omnimedicalsearch.com/
+^http://([a-z0-9]*\.)*nih.gov
+^http://([a-z0-9]*\.)*cdc.gov/
+^http://([a-z0-9]*\.)*cancer.gov
+^http://([a-z0-9]*\.)*medpagetoday.com/
+^http://([a-z0-9]*\.)*fda.gov
+^http://([a-z0-9]*\.)*ovid.com
+^http://([a-z0-9]*\.)*intute.ac.uk
+^http://([a-z0-9]*\.)*guideline.gov
+^http://([a-z0-9]*\.)*jwatch.org
+^http://([a-z0-9]*\.)*clinicaltrials.gov
+^http://([a-z0-9]*\.)*centerwatch.com
+^http://([a-z0-9]*\.)*eMedicine.com
+^http://([a-z0-9]*\.)*rxlist.com
+^http://([a-z0-9]*\.)*oncolink.com
+^http://([a-z0-9]*\.)*omnimedicalsearch.com
+^http://([a-z0-9]*\.)*mwsearch.com/
+^http://([a-z0-9]*\.)*hon.ch/MedHunt/
+^http://([a-z0-9]*\.)*medicinenet.com
+^http://([a-z0-9]*\.)*webmd.com/
+^http://([a-z0-9]*\.)*medlineplus.gov/
+^http://([a-z0-9]*\.)*emedisearch.com
+^http://([a-z0-9]*\.)*diabetes-experts.com
+^http://([a-z0-9]*\.)*obesity-experts.com
+^http://([a-z0-9]*\.)*insomnia-treatment101.com
+^http://([a-z0-9]*\.)*bursitis101.com
+^http://([a-z0-9]*\.)*prostate-experts.com
+^http://([a-z0-9]*\.)*cystic-fibrosis101.com
+^http://([a-z0-9]*\.)*acid-reflux101.com
+^http://([a-z0-9]*\.)*addiction-treatment101.com
+^http://([a-z0-9]*\.)*medicalndx.com/
+^http://([a-z0-9]*\.)*mwsearch.com
+^http://([a-z0-9]*\.)*ncbi.nlm.nih.gov/pubmed
+^http://([a-z0-9]*\.)*sumsearch.uthscsa.edu/
+^http://([a-z0-9]*\.)*health.flexfinder.com
+^http://([a-z0-9]*\.)*medic8.com
+^http://([a-z0-9]*\.)*healthatoz.com
+^http://([a-z0-9]*\.)*kmle.com
+^http://([a-z0-9]*\.)*medworld.stanford.edu/medbot/
+^http://([a-z0-9]*\.)*lib.uiowa.edu/hardin/md/
+^http://([a-z0-9]*\.)*HealthAtoZ.com/
+^http://([a-z0-9]*\.)*healthfinder.gov 
+^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.
+^http://([a-z0-9]*\.)*mdlinx.com
+^http://([a-z0-9]*\.)*unmc.edu/library/education/internet/medsearch.html#medical
+^http://([a-z0-9]*\.)*hon.ch
+^http://([a-z0-9]*\.)*medbioworld.com
+^http://([a-z0-9]*\.)*medlineplus.gov
+^http://([a-z0-9]*\.)*medscape.com
+^http://([a-z0-9]*\.)*scirus.com
+^http://([a-z0-9]*\.)*metacrawler.com
+^http://([a-z0-9]*\.)*vivisimo.com/
+^http://([a-z0-9]*\.)*livegrandrounds.com
+^http://([a-z0-9]*\.)*nlm.nih.gov/
+^http://([a-z0-9]*\.)*nih.gov/
+^http://([a-z0-9]*\.)*os.dhhs.gov/
+^http://([a-z0-9]*\.)*pubmedcentral.nih.gov/
+^http://([a-z0-9]*\.)*emedicine.com/EMERG/
+^http://([a-z0-9]*\.)*emedmag.com/
+^http://([a-z0-9]*\.)*aep.org/
+^http://([a-z0-9]*\.)*aaem.org/
+^http://([a-z0-9]*\.)*abem.org/public/
+^http://([a-z0-9]*\.)*ncemi.org/
+^http://([a-z0-9]*\.)*embbs.com
+^http://([a-z0-9]*\.)*emedhome.com
+^http://([a-z0-9]*\.)*biomedcentral.com/bmcemergmed/ 
+^http://([a-z0-9]*\.)*emj.bmj.com/
+^http://([a-z0-9]*\.)*emedicine.com/emerg/index.shtml
# skip everything else
-.

and here is another version that i tried:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*\S* 

# skip everything else
-.
 


-- 
View this message in context: http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19828279.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch and its Growing Capabilities

Posted by Kevin MacDonald <ke...@hautesecure.com>.
What does your crawl-urlfilter.txt file look like?

On Sun, Sep 21, 2008 at 12:05 PM, nutch_newbie <ka...@hotmail.com>wrote:

>
> Hello everyone
> I'm having some trouble with something in my nutch application, but have a
> hard time finding it, let alone fixing it. So if anyone can help with any
> suggestions, i would really appreciate it.
> My nutch is working just fine, it fetched and crawled, etc. and displays
> results. Only its not growing.
> My understanding is that when it has a url, as time goes by, it also
> somehow
> starts including links on that url... like if i give it samplewebsite.com,
> and that website has a link on it  (ex: someotherwebsite.com), it will
> eventlually display it during a search. so over time, instead of 6 results
> per keyword, i would get 7 or 8 or 9, and it would continue growing. Well,
> its not doing so. By the way, i set my nutch up for a whole-web crawling.
> So
> is there some sort of trick or plugin?
> I've spent a good few months experimenting and trying different configs,
> etc, but no luck.
> I would really appreciate if someone could help me out.
> Thank you in advance.
> --
> View this message in context:
> http://www.nabble.com/Nutch-and-its-Growing-Capabilities-tp19597372p19597372.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>