You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sol Lederman <so...@gmail.com> on 2017/11/15 19:22:59 UTC
Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?
If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
the site via nutch I get 28 records in the solr index.
Here's the relevant piece of my regex-urlfilter.txt file. It's just the
default that comes with nutch.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
# +.
+^http://([a-z0-9]*\.)*nutch.apache.org/
I'm sure I can find a number of examples of files that should be crawled
and aren't. Here's one example.
https://nutch.apache.org/javadoc.html has links to a number of
apidocs pages that are picked up by nutch. But, this page,
https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
referenced like this:
<li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>
I wouldn't imagine that relative links would be a problem as other relative
links are handled fine. And, I did click on that link and it doesn't stray
from nutch.apache.org.
I thought the problem might have to do with http vs. https. So, I changed
the last line of the filter to be this:
+^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
When I did that then the /miredot/ url got fetched and parsed but the
urls indexed into Solr were the same as before including https.
What am I missing?
Thanks.
Sol
Re: Why do I only get 28 records when I crawl the tutorial example of nutch.apache.org?
Posted by Sol Lederman <so...@gmail.com>.
Thanks. Including https didn't make a difference. Anyway, I've moved on to
other sites where I am getting lots more hits.
Sol
Re: Why do I only get 28 records when I crawl the tutorial example of
nutch.apache.org?
Posted by Sebastian Nagel <wa...@googlemail.com>.
... and the miredot URL is indexed but as http:// URL:
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"url:\"http://nutch.apache.org/miredot/1.12/index.html\"",
"indent":"on",
"wt":"json",
"_":"1510851832903"}},
"response":{"numFound":1,"start":0,"docs":[
{
"date":"2016-06-18T17:54:39Z",
"type":["text/html",
"text",
"html"],
"url":"http://nutch.apache.org/miredot/1.12/index.html",
But I've tried the recent master branch.
On 11/16/2017 06:28 PM, Sebastian Nagel wrote:
> Hi Sol,
>
>> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
>
> Thanks, I've updated the wiki patch to include https as well.
>
>
> How many cycles did you run the crawl? I got 28 pages after 3 cycles
> starting from http://nutch.apache.org/ ...
>
> Best,
> Sebastian
>
>
> On 11/15/2017 08:22 PM, Sol Lederman wrote:
>> If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
>> the site via nutch I get 28 records in the solr index.
>>
>> Here's the relevant piece of my regex-urlfilter.txt file. It's just the
>> default that comes with nutch.
>>
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>>
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> # accept anything else
>> # +.
>> +^http://([a-z0-9]*\.)*nutch.apache.org/
>>
>>
>> I'm sure I can find a number of examples of files that should be crawled
>> and aren't. Here's one example.
>>
>> https://nutch.apache.org/javadoc.html has links to a number of
>> apidocs pages that are picked up by nutch. But, this page,
>> https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
>> referenced like this:
>>
>> <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>
>>
>> I wouldn't imagine that relative links would be a problem as other relative
>> links are handled fine. And, I did click on that link and it doesn't stray
>> from nutch.apache.org.
>>
>> I thought the problem might have to do with http vs. https. So, I changed
>> the last line of the filter to be this:
>>
>> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
>>
>>
>> When I did that then the /miredot/ url got fetched and parsed but the
>> urls indexed into Solr were the same as before including https.
>>
>> What am I missing?
>>
>> Thanks.
>>
>> Sol
>>
>
Re: Why do I only get 28 records when I crawl the tutorial example of
nutch.apache.org?
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Sol,
> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
Thanks, I've updated the wiki patch to include https as well.
How many cycles did you run the crawl? I got 28 pages after 3 cycles
starting from http://nutch.apache.org/ ...
Best,
Sebastian
On 11/15/2017 08:22 PM, Sol Lederman wrote:
> If I google for site:nutch.apache.org I get ~12,500 results. When I crawl
> the site via nutch I get 28 records in the solr index.
>
> Here's the relevant piece of my regex-urlfilter.txt file. It's just the
> default that comes with nutch.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
> EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> # +.
> +^http://([a-z0-9]*\.)*nutch.apache.org/
>
>
> I'm sure I can find a number of examples of files that should be crawled
> and aren't. Here's one example.
>
> https://nutch.apache.org/javadoc.html has links to a number of
> apidocs pages that are picked up by nutch. But, this page,
> https://nutch.apache.org/miredot/1.12/index.html, is not picked up. It's
> referenced like this:
>
> <li><a href="./miredot/1.12/index.html">1.13 (1.X branch)</a></li>
>
> I wouldn't imagine that relative links would be a problem as other relative
> links are handled fine. And, I did click on that link and it doesn't stray
> from nutch.apache.org.
>
> I thought the problem might have to do with http vs. https. So, I changed
> the last line of the filter to be this:
>
> +^(http|https)://([a-z0-9]*\.)*nutch.apache.org/
>
>
> When I did that then the /miredot/ url got fetched and parsed but the
> urls indexed into Solr were the same as before including https.
>
> What am I missing?
>
> Thanks.
>
> Sol
>