You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sol Lederman <so...@gmail.com> on 2017/11/15 19:22:59 UTC

Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?

If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
the site via nutch I get 28 records in the solr index.

Here's the relevant piece of my regex-urlfilter.txt file. It's just the
default that comes with nutch.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
# +.
+^http://([a-z0-9]*\.)*nutch.apache.org/


I'm sure I can find a number of examples of files that should be crawled
and aren't. Here's one example.

https://nutch.apache.org/javadoc.html has links to a number of
apidocs pages that are picked up by nutch. But, this page,
https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
referenced like this:

    <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>

I wouldn't imagine that relative links would be a problem as other relative
links are handled fine. And, I did click on that link and it doesn't stray
from nutch.apache.org.

I thought the problem might have to do with http vs. https. So, I changed
the last line of the filter to be this:

+^(http|https)://([a-z0-9]*\.)*nutch.apache.org/


When I did that then the /miredot/ url got fetched and parsed but the
urls indexed into Solr were the same as before including https.

What am I missing?

Thanks.

Sol

Re: Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?

Posted by Sol Lederman <so...@gmail.com>.
Thanks. Including https didn't make a difference. Anyway, I've moved on to
other sites where I am getting lots more hits.

Sol

Re: Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?

Posted by Sebastian Nagel <wa...@googlemail.com>.
... and the miredot URL is indexed but as http:// URL:

{
  "responseHeader":{
    "status":0,
    "QTime":4,
    "params":{
      "q":"url:\"http://nutch.apache.org/miredot/1.12/index.html\"",
      "indent":"on",
      "wt":"json",
      "_":"1510851832903"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "date":"2016-06-18T17:54:39Z",
        "type":["text/html",
          "text",
          "html"],
        "url":"http://nutch.apache.org/miredot/1.12/index.html",


But I've tried the recent master branch.


On 11/16/2017 06:28 PM, Sebastian Nagel wrote:
> Hi Sol,
> 
>> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
> 
> Thanks, I've updated the wiki patch to include https as well.
> 
> 
> How many cycles did you run the crawl? I got 28 pages after 3 cycles
> starting from http://nutch.apache.org/ ...
> 
> Best,
> Sebastian
> 
> 
> On 11/15/2017 08:22 PM, Sol Lederman wrote:
>> If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
>> the site via nutch I get 28 records in the solr index.
>>
>> Here's the relevant piece of my regex-urlfilter.txt file. It's just the
>> default that comes with nutch.
>>
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>>
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> # accept anything else
>> # +.
>> +^http://([a-z0-9]*\.)*nutch.apache.org/
>>
>>
>> I'm sure I can find a number of examples of files that should be crawled
>> and aren't. Here's one example.
>>
>> https://nutch.apache.org/javadoc.html has links to a number of
>> apidocs pages that are picked up by nutch. But, this page,
>> https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
>> referenced like this:
>>
>>     <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>
>>
>> I wouldn't imagine that relative links would be a problem as other relative
>> links are handled fine. And, I did click on that link and it doesn't stray
>> from nutch.apache.org.
>>
>> I thought the problem might have to do with http vs. https. So, I changed
>> the last line of the filter to be this:
>>
>> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
>>
>>
>> When I did that then the /miredot/ url got fetched and parsed but the
>> urls indexed into Solr were the same as before including https.
>>
>> What am I missing?
>>
>> Thanks.
>>
>> Sol
>>
> 


Re: Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Sol,

> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/

Thanks, I've updated the wiki patch to include https as well.


How many cycles did you run the crawl? I got 28 pages after 3 cycles
starting from http://nutch.apache.org/ ...

Best,
Sebastian


On 11/15/2017 08:22 PM, Sol Lederman wrote:
> If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
> the site via nutch I get 28 records in the solr index.
> 
> Here's the relevant piece of my regex-urlfilter.txt file. It's just the
> default that comes with nutch.
> 
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # accept anything else
> # +.
> +^http://([a-z0-9]*\.)*nutch.apache.org/
> 
> 
> I'm sure I can find a number of examples of files that should be crawled
> and aren't. Here's one example.
> 
> https://nutch.apache.org/javadoc.html has links to a number of
> apidocs pages that are picked up by nutch. But, this page,
> https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
> referenced like this:
> 
>     <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>
> 
> I wouldn't imagine that relative links would be a problem as other relative
> links are handled fine. And, I did click on that link and it doesn't stray
> from nutch.apache.org.
> 
> I thought the problem might have to do with http vs. https. So, I changed
> the last line of the filter to be this:
> 
> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
> 
> 
> When I did that then the /miredot/ url got fetched and parsed but the
> urls indexed into Solr were the same as before including https.
> 
> What am I missing?
> 
> Thanks.
> 
> Sol
>