You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by spamsucks <sp...@rhoderunner.com> on 2006/12/06 16:05:01 UTC

page1 is crawled, but not pages in page1

My subject is a pretty good summary.  I see the first "details.pa?id=123" in 
my results, but can't search or find any "details.pa?id=456" links that are 
in that 1st page that was a hit.

Backgrounder:
I have a site that includes a lot of dynamic pages.  I edited the 
crawl-urlfilter.txt and added the following regex and did
a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):

+^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=

Now the search will return hits on the dynamic details page.  For example,
here is a search that returns hits on my dynamic pages.
http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10&lang=en

If you look at the details.pa page that nutch had a hit on, it contains 
several links of the same format ( details.pa )
My problem is that these other detail links are not being crawled/indexed.

I set the depth to "30" so that should not be a limiting factor.  I also set 
a "topN" of 30000, because we have around 16K details.pa pages

Any clues on how to proceed and figure out what I need to do to get Nutch to 
crawl these missing "details.pa" links

Re: page1 is crawled, but not pages in page1

Posted by Nitin Borwankar <ni...@borwankar.com>.

Hi Philip,

You have www.visitpa.com in your crawl-url-filter regexp.
If some of your other pages have <something else>.visitpa.com as host 
name they will be filtered out.
You may want to have just (....)visitpa.com in the regexp in  that case.
Just a thought.

Nitin Borwankar
http://tagschema.com

spamsucks wrote:

> Hi Yoni,
>
> That was a good thought, however, according to the logging output of 
> the crawl, I see the following...
>
> fetching http://www.visitpa.com/visitpa/details.pa?id=65851
> fetching http://www.visitpa.com/visitpa/details.pa?id=246139
> fetching http://www.visitpa.com/visitpa/details.pa?id=8427
>
> There are at least 100+ of these (too many to count) so it appears 
> that nutch is fetching these url's although the url is not unique 
> without the query string.
>
> Building upon your thought, perhaps the other "details.pa" pages are 
> coming from other pages being indexed, and only one "details.pa" page 
> is being used in the sense of a crawl.  That could be what is 
> happening here and your point is correct.
>
> I appreciate your response!
> Phillip
>
>
>
> ----- Original Message ----- From: "Yoni Amir" <yo...@targetize.com>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, December 06, 2006 10:47 AM
> Subject: Re: page1 is crawled, but not pages in page1
>
>
> The I think that in crawldb and linkdb, the actual url, without the
> query string, serves as primary key (i.e. a url is determined as unique
> just by looking at the url, without the query string). Thus, after your
> first page is fetched, and you run updatedb, nutch doesn't think that it
> needs to fetch it again because it already sees an entry for it in the
> database.
>
> I am also new to nutch, so I don't know if there is a solution to your
> problem.
>
> Yoni
>
> On Wed, 2006-12-06 at 10:05 -0500, spamsucks wrote:
>
>> My subject is a pretty good summary.  I see the first 
>> "details.pa?id=123" in
>> my results, but can't search or find any "details.pa?id=456" links 
>> that are
>> in that 1st page that was a hit.
>>
>> Backgrounder:
>> I have a site that includes a lot of dynamic pages.  I edited the
>> crawl-urlfilter.txt and added the following regex and did
>> a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):
>>
>> +^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=
>>
>> Now the search will return hits on the dynamic details page.  For 
>> example,
>> here is a search that returns hits on my dynamic pages.
>> http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10〈=en 
>>
>>
>> If you look at the details.pa page that nutch had a hit on, it contains
>> several links of the same format ( details.pa )
>> My problem is that these other detail links are not being 
>> crawled/indexed.
>>
>> I set the depth to "30" so that should not be a limiting factor.  I 
>> also set
>> a "topN" of 30000, because we have around 16K details.pa pages
>>
>> Any clues on how to proceed and figure out what I need to do to get 
>> Nutch to
>> crawl these missing "details.pa" links
>>
>>
>>
>>
>>
>
>

Re: page1 is crawled, but not pages in page1

Posted by spamsucks <sp...@rhoderunner.com>.

Hi Yoni,

That was a good thought, however, according to the logging output of the 
crawl, I see the following...

fetching http://www.visitpa.com/visitpa/details.pa?id=65851
fetching http://www.visitpa.com/visitpa/details.pa?id=246139
fetching http://www.visitpa.com/visitpa/details.pa?id=8427

There are at least 100+ of these (too many to count) so it appears that 
nutch is fetching these url's although the url is not unique without the 
query string.

Building upon your thought, perhaps the other "details.pa" pages are coming 
from other pages being indexed, and only one "details.pa" page is being used 
in the sense of a crawl.  That could be what is happening here and your 
point is correct.

I appreciate your response!
Phillip

----- Original Message ----- 
From: "Yoni Amir" <yo...@targetize.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, December 06, 2006 10:47 AM
Subject: Re: page1 is crawled, but not pages in page1

The I think that in crawldb and linkdb, the actual url, without the
query string, serves as primary key (i.e. a url is determined as unique
just by looking at the url, without the query string). Thus, after your
first page is fetched, and you run updatedb, nutch doesn't think that it
needs to fetch it again because it already sees an entry for it in the
database.

I am also new to nutch, so I don't know if there is a solution to your
problem.

Yoni

On Wed, 2006-12-06 at 10:05 -0500, spamsucks wrote:
> My subject is a pretty good summary.  I see the first "details.pa?id=123" 
> in
> my results, but can't search or find any "details.pa?id=456" links that 
> are
> in that 1st page that was a hit.
>
> Backgrounder:
> I have a site that includes a lot of dynamic pages.  I edited the
> crawl-urlfilter.txt and added the following regex and did
> a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):
>
> +^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=
>
> Now the search will return hits on the dynamic details page.  For example,
> here is a search that returns hits on my dynamic pages.
> http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10〈=en
>
> If you look at the details.pa page that nutch had a hit on, it contains
> several links of the same format ( details.pa )
> My problem is that these other detail links are not being crawled/indexed.
>
> I set the depth to "30" so that should not be a limiting factor.  I also 
> set
> a "topN" of 30000, because we have around 16K details.pa pages
>
> Any clues on how to proceed and figure out what I need to do to get Nutch 
> to
> crawl these missing "details.pa" links
>
>
>
>
>

Re: page1 is crawled, but not pages in page1

Posted by Yoni Amir <yo...@targetize.com>.

The I think that in crawldb and linkdb, the actual url, without the
query string, serves as primary key (i.e. a url is determined as unique
just by looking at the url, without the query string). Thus, after your
first page is fetched, and you run updatedb, nutch doesn't think that it
needs to fetch it again because it already sees an entry for it in the
database.

I am also new to nutch, so I don't know if there is a solution to your
problem.

Yoni

On Wed, 2006-12-06 at 10:05 -0500, spamsucks wrote:
> My subject is a pretty good summary.  I see the first "details.pa?id=123" in 
> my results, but can't search or find any "details.pa?id=456" links that are 
> in that 1st page that was a hit.
> 
> Backgrounder:
> I have a site that includes a lot of dynamic pages.  I edited the 
> crawl-urlfilter.txt and added the following regex and did
> a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):
> 
> +^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=
> 
> Now the search will return hits on the dynamic details page.  For example,
> here is a search that returns hits on my dynamic pages.
> http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10〈=en
> 
> If you look at the details.pa page that nutch had a hit on, it contains 
> several links of the same format ( details.pa )
> My problem is that these other detail links are not being crawled/indexed.
> 
> I set the depth to "30" so that should not be a limiting factor.  I also set 
> a "topN" of 30000, because we have around 16K details.pa pages
> 
> Any clues on how to proceed and figure out what I need to do to get Nutch to 
> crawl these missing "details.pa" links
> 
> 
> 
> 
>