You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Amitabha Banerjee <hi...@gmail.com> on 2008/09/11 05:29:11 UTC

Unable to crawl all links

Hi folks,
I am unable to crawl all the links in my website. For some reason, only one
or two links are picked up by nutch.

Here is the website I am trying to index: http://www.knowmydestination.com

All links a this website are internal.

My crawl-urlfilter does not block any kind of internal links. It looks as
possible.

# accept hosts in MY.DOMAIN.NAME
+^http://www.knowmydestination.com/

# skip everything else
-.

My urls are:  http://www.knowmydestination.com/

When I run:
bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100

nutch only crwal one link
http://www.knowmydestination.com/articles/cheapfares.html

Can anyone help me figre this out.

/Amitab

Re: Unable to crawl all links

Posted by vishal vachhani <vi...@gmail.com>.

hi Chetan ,
                  check properties call "db.ignore.external.links" and
"db.ignore.internal.links". in nutch-default.xml.  Description is given in
the file what does that mean. you can act accordingly if all pages from the
crawl url is not being crawled.


--vishal

On Sat, Sep 27, 2008 at 3:18 PM, Chetan Patel <ch...@webmail.aruhat.com>wrote:

>
> Hi,
>
> Here is my crawl-urlfilter.txt file
> =============================
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # accept everything else
> +.
> =============================
>
> Please let me know how can I crawl all URL?
>
> Thank you.
>
> Regards,
> Chetan Patel
>
>
> Edward Quick wrote:
> >
> >
> > Chetan, if you haven't already done this, check your crawl-urlfilter.txt
> > (or regex-urlfilter.txt if you're running nutch fetch) and comment out
> the
> > line below:
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > Ed.
> >
> >
> >> Date: Fri, 26 Sep 2008 23:18:29 -0700
> >> From: chetan@webmail.aruhat.com
> >> To: nutch-user@lucene.apache.org
> >> Subject: Re: Unable to crawl all links
> >>
> >>
> >> Hi All,
> >>
> >> I have read the db file and it display result below.
> >> ==========================================
> >> $ bin/nutch readdb mytest/crawldb -stats
> >> CrawlDb statistics start: mytest/crawldb
> >> Statistics for CrawlDb: mytest/crawldb
> >> TOTAL urls:     345
> >> retry 0:        345
> >> min score:      0.0
> >> avg score:      0.028
> >> max score:      1.055
> >> status 1 (db_unfetched):        285
> >> status 2 (db_fetched):  48
> >> status 3 (db_gone):     5
> >> status 4 (db_redir_temp):       4
> >> status 5 (db_redir_perm):       3
> >> CrawlDb statistics: done
> >> ==========================================
> >>
> >> You can see there are total 345 URL. From that nutch have fetch only 48.
> >> I need to fetch all 345 URL.
> >>
> >> I have also tried kevin's solution.
> >>
> >> Please help me I am new in nutch.
> >>
> >> Thanks.
> >>
> >> Regards,
> >> Chetan Patel
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> > _________________________________________________________________
> > Win New York holidays with Kellogg's & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> >
>
> --
> View this message in context:
> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19701303.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

RE: Unable to crawl all links

Posted by Edward Quick <ed...@hotmail.com>.


> 
> 
> Hi,
> 
> Here is my crawl-urlfilter.txt file
> =============================
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # accept everything else
> +.
> =============================

Well that should cover most of it! I'm not one of the experts here, but think that the 285 unfetched urls in your stats below, means that you need to run 'nutch fetch' on your most recent segment to fetch 285 new links eg, try something like this:

nutch fetch crawl/segments/200809251253 > crawl.log

then go through the crawl.log to see which fetches are failing, reconfigure, and re-run.



> 
> Please let me know how can I crawl all URL?
> 
> Thank you.
> 
> Regards,
> Chetan Patel
> 
> 
> Edward Quick wrote:
> > 
> > 
> > Chetan, if you haven't already done this, check your crawl-urlfilter.txt
> > (or regex-urlfilter.txt if you're running nutch fetch) and comment out the
> > line below:
> > 
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> > 
> > Ed.
> > 
> > 
> >> Date: Fri, 26 Sep 2008 23:18:29 -0700
> >> From: chetan@webmail.aruhat.com
> >> To: nutch-user@lucene.apache.org
> >> Subject: Re: Unable to crawl all links
> >> 
> >> 
> >> Hi All,
> >> 
> >> I have read the db file and it display result below.
> >> ==========================================
> >> $ bin/nutch readdb mytest/crawldb -stats
> >> CrawlDb statistics start: mytest/crawldb
> >> Statistics for CrawlDb: mytest/crawldb
> >> TOTAL urls:     345
> >> retry 0:        345
> >> min score:      0.0
> >> avg score:      0.028
> >> max score:      1.055
> >> status 1 (db_unfetched):        285
> >> status 2 (db_fetched):  48
> >> status 3 (db_gone):     5
> >> status 4 (db_redir_temp):       4
> >> status 5 (db_redir_perm):       3
> >> CrawlDb statistics: done
> >> ==========================================
> >> 
> >> You can see there are total 345 URL. From that nutch have fetch only 48.
> >> I need to fetch all 345 URL.
> >> 
> >> I have also tried kevin's solution.
> >> 
> >> Please help me I am new in nutch.
> >> 
> >> Thanks.
> >> 
> >> Regards,
> >> Chetan Patel
> >> 
> >> 
> >> -- 
> >> View this message in context:
> >> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >> 
> > 
> > _________________________________________________________________
> > Win New York holidays with Kellogg’s & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> > 
> 
> -- 
> View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19701303.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/

RE: Unable to crawl all links

Posted by Chetan Patel <ch...@webmail.aruhat.com>.

Hi,

Here is my crawl-urlfilter.txt file
=============================
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# accept everything else
+.
=============================

Please let me know how can I crawl all URL?

Thank you.

Regards,
Chetan Patel


Edward Quick wrote:
> 
> 
> Chetan, if you haven't already done this, check your crawl-urlfilter.txt
> (or regex-urlfilter.txt if you're running nutch fetch) and comment out the
> line below:
> 
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> 
> Ed.
> 
> 
>> Date: Fri, 26 Sep 2008 23:18:29 -0700
>> From: chetan@webmail.aruhat.com
>> To: nutch-user@lucene.apache.org
>> Subject: Re: Unable to crawl all links
>> 
>> 
>> Hi All,
>> 
>> I have read the db file and it display result below.
>> ==========================================
>> $ bin/nutch readdb mytest/crawldb -stats
>> CrawlDb statistics start: mytest/crawldb
>> Statistics for CrawlDb: mytest/crawldb
>> TOTAL urls:     345
>> retry 0:        345
>> min score:      0.0
>> avg score:      0.028
>> max score:      1.055
>> status 1 (db_unfetched):        285
>> status 2 (db_fetched):  48
>> status 3 (db_gone):     5
>> status 4 (db_redir_temp):       4
>> status 5 (db_redir_perm):       3
>> CrawlDb statistics: done
>> ==========================================
>> 
>> You can see there are total 345 URL. From that nutch have fetch only 48.
>> I need to fetch all 345 URL.
>> 
>> I have also tried kevin's solution.
>> 
>> Please help me I am new in nutch.
>> 
>> Thanks.
>> 
>> Regards,
>> Chetan Patel
>> 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> 
> 
> _________________________________________________________________
> Win New York holidays with Kellogg’s & Live Search
> http://clk.atdmt.com/UKM/go/111354033/direct/01/
> 

-- 
View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19701303.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Unable to crawl all links

Posted by Edward Quick <ed...@hotmail.com>.

Chetan, if you haven't already done this, check your crawl-urlfilter.txt (or regex-urlfilter.txt if you're running nutch fetch) and comment out the line below:

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

Ed.


> Date: Fri, 26 Sep 2008 23:18:29 -0700
> From: chetan@webmail.aruhat.com
> To: nutch-user@lucene.apache.org
> Subject: Re: Unable to crawl all links
> 
> 
> Hi All,
> 
> I have read the db file and it display result below.
> ==========================================
> $ bin/nutch readdb mytest/crawldb -stats
> CrawlDb statistics start: mytest/crawldb
> Statistics for CrawlDb: mytest/crawldb
> TOTAL urls:     345
> retry 0:        345
> min score:      0.0
> avg score:      0.028
> max score:      1.055
> status 1 (db_unfetched):        285
> status 2 (db_fetched):  48
> status 3 (db_gone):     5
> status 4 (db_redir_temp):       4
> status 5 (db_redir_perm):       3
> CrawlDb statistics: done
> ==========================================
> 
> You can see there are total 345 URL. From that nutch have fetch only 48.
> I need to fetch all 345 URL.
> 
> I have also tried kevin's solution.
> 
> Please help me I am new in nutch.
> 
> Thanks.
> 
> Regards,
> Chetan Patel
> 
> 
> -- 
> View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/

Re: Unable to crawl all links

Posted by Chetan Patel <ch...@webmail.aruhat.com>.

Hi All,

I have read the db file and it display result below.
==========================================
$ bin/nutch readdb mytest/crawldb -stats
CrawlDb statistics start: mytest/crawldb
Statistics for CrawlDb: mytest/crawldb
TOTAL urls:     345
retry 0:        345
min score:      0.0
avg score:      0.028
max score:      1.055
status 1 (db_unfetched):        285
status 2 (db_fetched):  48
status 3 (db_gone):     5
status 4 (db_redir_temp):       4
status 5 (db_redir_perm):       3
CrawlDb statistics: done
==========================================

You can see there are total 345 URL. From that nutch have fetch only 48.
I need to fetch all 345 URL.

I have also tried kevin's solution.

Please help me I am new in nutch.

Thanks.

Regards,
Chetan Patel


-- 
View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Unable to crawl all links

Posted by Kevin MacDonald <ke...@hautesecure.com>.

Dig into the code. Look at Fetcher.run() and Fetcher.handleRedirect(). Put
extra logging lines around the filters and normalizers to see if your urls
are showing up but being either removed or altered. You can also disable all
Normalizers by modifying the 'plugin.includes' property. Copy the property
from nutch-default.xml to nutch-site.xml and remove the normalizers.

On Fri, Sep 26, 2008 at 6:16 AM, Chetan Patel <ch...@webmail.aruhat.com>wrote:

>
> Hi Vishal,
>
> I have same problem as Amitabha.
>
> I have did as per your instruction but still nutch not crawl all url.
>
> Please help me.
>
> Thanks in advance.
>
> Regards,
> Chetan Patel
>
>
> vishal vachhani wrote:
> >
> > Hi Amitabha,
> >                   Look at the nutch-default.xml. Change the following
> > property in order to crawl whole site.
> >
> > db.ignore.internal.links ----by default it is true which should be false
> > in
> > you case.
> >
> > --Vishal
> >
> >
> > On Thu, Sep 11, 2008 at 8:59 AM, Amitabha Banerjee
> > <hi...@gmail.com>wrote:
> >
> >> Hi folks,
> >> I am unable to crawl all the links in my website. For some reason, only
> >> one
> >> or two links are picked up by nutch.
> >>
> >> Here is the website I am trying to index:
> >> http://www.knowmydestination.com
> >>
> >> All links a this website are internal.
> >>
> >> My crawl-urlfilter does not block any kind of internal links. It looks
> as
> >> possible.
> >>
> >> # accept hosts in MY.DOMAIN.NAME
> >> +^http://www.knowmydestination.com/
> >>
> >> # skip everything else
> >> -.
> >>
> >> My urls are:  http://www.knowmydestination.com/
> >>
> >> When I run:
> >> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
> >>
> >> nutch only crwal one link
> >> http://www.knowmydestination.com/articles/cheapfares.html
> >>
> >> Can anyone help me figre this out.
> >>
> >> /Amitab
> >>
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vishal Vachhani
> > M.tech, CSE dept
> > Indian Institute of Technology, Bombay
> > http://www.cse.iitb.ac.in/~vishalv
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19688302.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Unable to crawl all links

Posted by Chetan Patel <ch...@webmail.aruhat.com>.

Hi Vishal,

I have same problem as Amitabha. 

I have did as per your instruction but still nutch not crawl all url.

Please help me.

Thanks in advance.

Regards,
Chetan Patel


vishal vachhani wrote:
> 
> Hi Amitabha,
>                   Look at the nutch-default.xml. Change the following
> property in order to crawl whole site.
> 
> db.ignore.internal.links ----by default it is true which should be false
> in
> you case.
> 
> --Vishal
> 
> 
> On Thu, Sep 11, 2008 at 8:59 AM, Amitabha Banerjee
> <hi...@gmail.com>wrote:
> 
>> Hi folks,
>> I am unable to crawl all the links in my website. For some reason, only
>> one
>> or two links are picked up by nutch.
>>
>> Here is the website I am trying to index:
>> http://www.knowmydestination.com
>>
>> All links a this website are internal.
>>
>> My crawl-urlfilter does not block any kind of internal links. It looks as
>> possible.
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://www.knowmydestination.com/
>>
>> # skip everything else
>> -.
>>
>> My urls are:  http://www.knowmydestination.com/
>>
>> When I run:
>> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
>>
>> nutch only crwal one link
>> http://www.knowmydestination.com/articles/cheapfares.html
>>
>> Can anyone help me figre this out.
>>
>> /Amitab
>>
> 
> 
> 
> -- 
> Thanks and Regards,
> Vishal Vachhani
> M.tech, CSE dept
> Indian Institute of Technology, Bombay
> http://www.cse.iitb.ac.in/~vishalv
> 
> 

-- 
View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19688302.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Unable to crawl all links

Posted by vishal vachhani <vi...@gmail.com>.

Hi Amitabha,
                  Look at the nutch-default.xml. Change the following
property in order to crawl whole site.

db.ignore.internal.links ----by default it is true which should be false in
you case.

--Vishal


On Thu, Sep 11, 2008 at 8:59 AM, Amitabha Banerjee <hi...@gmail.com>wrote:

> Hi folks,
> I am unable to crawl all the links in my website. For some reason, only one
> or two links are picked up by nutch.
>
> Here is the website I am trying to index: http://www.knowmydestination.com
>
> All links a this website are internal.
>
> My crawl-urlfilter does not block any kind of internal links. It looks as
> possible.
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://www.knowmydestination.com/
>
> # skip everything else
> -.
>
> My urls are:  http://www.knowmydestination.com/
>
> When I run:
> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
>
> nutch only crwal one link
> http://www.knowmydestination.com/articles/cheapfares.html
>
> Can anyone help me figre this out.
>
> /Amitab
>



-- 
Thanks and Regards,
Vishal Vachhani
M.tech, CSE dept
Indian Institute of Technology, Bombay
http://www.cse.iitb.ac.in/~vishalv

Re: Unable to crawl all links

Posted by Kevin MacDonald <ke...@hautesecure.com>.

Your crawl-urlfilter is too specific. It is a regular expression that needs
to match on every url you want to hit. Try

+^http://([a-z0-9]*\.)*\S*

That will allow most any url.

Kevin

On Wed, Sep 10, 2008 at 8:29 PM, Amitabha Banerjee <hi...@gmail.com>wrote:

> Hi folks,
> I am unable to crawl all the links in my website. For some reason, only one
> or two links are picked up by nutch.
>
> Here is the website I am trying to index: http://www.knowmydestination.com
>
> All links a this website are internal.
>
> My crawl-urlfilter does not block any kind of internal links. It looks as
> possible.
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://www.knowmydestination.com/
>
> # skip everything else
> -.
>
> My urls are:  http://www.knowmydestination.com/
>
> When I run:
> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
>
> nutch only crwal one link
> http://www.knowmydestination.com/articles/cheapfares.html
>
> Can anyone help me figre this out.
>
> /Amitab
>