You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ahammad <ah...@gmail.com> on 2009/01/12 18:03:52 UTC

Crawler not fetching all the links

I just started using Nutch to crawl an intranet site. In my urls file, I have
a single link that refers to a jhtml page, which contains roughly 2000 links
in it. The links contain characters like '?' and '=', so I removed the
following from the crawl-urlfilter.txt file:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

I finally got the crawl to work, but I only see 111 results under "TOTAL
urls:" when I run the following command:

bin/nutch readdb crawlTest/crawldb -stats 

I'm not sure where to look at this point. Any ideas?

BTW what's the command that dumps all the links? Every one that I found
online doesn't work...

Cheers
-- 
View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler not fetching all the links

Posted by ahammad <ah...@gmail.com>.

Hello,

The links are all the same format, they are not redirects. Is there
something significant I need to know about redirects other than the
http.redirect.max property?

In any case, I figured out the issue. Like Eric suggested, it was the
file.content.limit property. I increased the value a hundred times and it
fetched every link. Thanks you all for your advice.

Cheers



Doğacan Güney-3 wrote:
> 
> On Wed, Jan 14, 2009 at 8:44 PM, ahammad <ah...@gmail.com> wrote:
>>
>> Hello,
>>
>> I'm still unable to find why Nutch is unable to fetch and index all the
>> links that are on the page. To recap, the Nutch urls file contains a link
>> to
>> a jhtml file that contains roughly 2000 links, all hosted on the same
>> server
>> in the same folder.
>>
>> Previously, I only got 111 links when I crawl. This was due to this:
>>
>> <property>
>>  <name>db.max.outlinks.per.page</name>
>>  <value>100</value>
>>  <description>The maximum number of outlinks that we'll process for a
>> page.
>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>> outlinks
>>  will be processed for a page; otherwise, all outlinks will be processed.
>>  </description>
>> </property>
>>
>> I changed the value to 2000, but I only got back 719 results. I also
>> tried
>> to make the value -1, and I still get 719 results.
>>
>> What other settings can affect this? I've been trying to tweak
>> nutch-default.xml, but I couldn't improve the number of results. Any help
>> with this would be appreciated.
>>
> 
> What does urls that are not fetched look like? Are they redirects?
> 
>> Thank you.
>>
>> Cheers
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Doğacan Güney
> 
> 

-- 
View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21482360.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler not fetching all the links

Posted by Doğacan Güney <do...@gmail.com>.

On Wed, Jan 14, 2009 at 8:44 PM, ahammad <ah...@gmail.com> wrote:
>
> Hello,
>
> I'm still unable to find why Nutch is unable to fetch and index all the
> links that are on the page. To recap, the Nutch urls file contains a link to
> a jhtml file that contains roughly 2000 links, all hosted on the same server
> in the same folder.
>
> Previously, I only got 111 links when I crawl. This was due to this:
>
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>100</value>
>  <description>The maximum number of outlinks that we'll process for a page.
>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>  will be processed for a page; otherwise, all outlinks will be processed.
>  </description>
> </property>
>
> I changed the value to 2000, but I only got back 719 results. I also tried
> to make the value -1, and I still get 719 results.
>
> What other settings can affect this? I've been trying to tweak
> nutch-default.xml, but I couldn't improve the number of results. Any help
> with this would be appreciated.
>

What does urls that are not fetched look like? Are they redirects?

> Thank you.
>
> Cheers
>
>
>
> --
> View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: Crawler not fetching all the links

Posted by "Eric J. Christeson" <Er...@ndsu.edu>.

On Jan 14, 2009, at 12:44 PM, ahammad wrote:

>
> Hello,
>
> I'm still unable to find why Nutch is unable to fetch and index all  
> the
> links that are on the page. To recap, the Nutch urls file contains  
> a link to
> a jhtml file that contains roughly 2000 links, all hosted on the  
> same server
> in the same folder.
>
> Previously, I only got 111 links when I crawl. This was due to this:
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>100</value>
>   <description>The maximum number of outlinks that we'll process  
> for a page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be  
> processed.
>   </description>
> </property>

You may also want to change this one:

<property>
   <name>file.content.limit</name>
   <value>65536</value>
   <description>The length limit for downloaded content, in bytes.
   If this value is nonnegative (>=0), content longer than it will be  
truncated;
   otherwise, no truncation at all.
   </description>
</property>

Eric
--
Eric J. Christeson                                  
<Er...@ndsu.edu>
Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University

nutch setup

Posted by Alex Basa <al...@yahoo.com>.

I have 6 blade servers (2 socket, quad-core Intel running at 3.0 Ghz, and 32 GB RAM) setup on a SAN for crawling.  Since they are all using the same SAN, what is the most efficient way to set Nutch up?  Should I be using a hadoop cluster?

I plan to incrementally build the indexes.  If anyone has some docs or can point me to some, I'd appreciate it.

Thanks in advance,

Alex

Re: Crawler not fetching all the links

Posted by ahammad <ah...@gmail.com>.

Hello,

I'm still unable to find why Nutch is unable to fetch and index all the
links that are on the page. To recap, the Nutch urls file contains a link to
a jhtml file that contains roughly 2000 links, all hosted on the same server
in the same folder.

Previously, I only got 111 links when I crawl. This was due to this:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

I changed the value to 2000, but I only got back 719 results. I also tried
to make the value -1, and I still get 719 results.

What other settings can affect this? I've been trying to tweak
nutch-default.xml, but I couldn't improve the number of results. Any help
with this would be appreciated.

Thank you.

Cheers

ahammad wrote:
> 
> I just started using Nutch to crawl an intranet site. In my urls file, I
> have a single link that refers to a jhtml page, which contains roughly
> 2000 links in it. The links contain characters like '?' and '=', so I
> removed the following from the crawl-urlfilter.txt file:
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> I finally got the crawl to work, but I only see 111 results under "TOTAL
> urls:" when I run the following command:
> 
> bin/nutch readdb crawlTest/crawldb -stats 
> 
> I'm not sure where to look at this point. Any ideas?
> 
> BTW what's the command that dumps all the links? Every one that I found
> online doesn't work...
> 
> Cheers
> 

-- 
View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler not fetching all the links

Posted by ahammad <ah...@gmail.com>.



Doğacan Güney-3 wrote:
> 
> Hi,
> 
> Nutch only considers the first 100 links from a page by default. You
> can change this with
> this options:
> 
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>100</value>
>   <description>The maximum number of outlinks that we'll process for a
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
> </property>
> 
> You can dump inverted links with command "readlinkdb". To see all links
> from a page you can do:
> 
> bin/nutch readseg -get <segment> <url> -nocontent -nofetch -noparse
> -nogenerate -noparsetext
> 
>> Cheers
>> --
>> View this message in context:
>> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> -- 
> Doğacan Güney
> 
> 



Thank you for pointing me in the right direction. I changed the value from
100 to 2000. Now I get 719 results. It certainly is an improvement, but it
is still a lot less than the actual number of links on the jhtml page.

What other settings can affect this (ie file size etc)? Would you have any
suggestions?

Thank you very much for your time.

Cheers
-- 
View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21420769.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler not fetching all the links

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On Mon, Jan 12, 2009 at 7:03 PM, ahammad <ah...@gmail.com> wrote:
>
> I just started using Nutch to crawl an intranet site. In my urls file, I have
> a single link that refers to a jhtml page, which contains roughly 2000 links
> in it. The links contain characters like '?' and '=', so I removed the
> following from the crawl-urlfilter.txt file:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> I finally got the crawl to work, but I only see 111 results under "TOTAL
> urls:" when I run the following command:
>
> bin/nutch readdb crawlTest/crawldb -stats
>
> I'm not sure where to look at this point. Any ideas?
>
> BTW what's the command that dumps all the links? Every one that I found
> online doesn't work...
>

Nutch only considers the first 100 links from a page by default. You
can change this with
this options:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

You can dump inverted links with command "readlinkdb". To see all links
from a page you can do:

bin/nutch readseg -get <segment> <url> -nocontent -nofetch -noparse
-nogenerate -noparsetext

> Cheers
> --
> View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

-- 
Doğacan Güney