You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Siddhartha Sandhu <si...@icloud.com> on 2015/03/10 10:40:25 UTC
Filter rejecting url
Hi,
I am running the command:
root@ubuntu:/usr/lib/nutch/nutch/runtime/local/bin# ./nutch inject ../../../urls/
InjectorJob: starting at 2015-03-10 02:24:40
InjectorJob: Injecting urlDir: ../../../urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 1
InjectorJob: total number of urls injected after normalization and filtering: 0
Injector: finished at 2015-03-10 02:24:48, elapsed: 00:00:08
My "../../../urls/" contains a txt file with value:
http://www.yahoo.com
My regex-urlfilter.txt is:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|js|JS)$
+\.(JPG|jpg|PNG|png|jpeg|JPEG|BMP|bmp)
# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.*
My nutch-site.xml contains:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
</configuration>
Log entry for corresponding run in nutch/runtime/local/logs/hadoop.log is:
2015-03-10 02:24:46,429 WARN snappy.LoadSnappy - Snappy native library not loaded
2015-03-10 02:24:47,884 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2015-03-10 02:24:47,900 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-03-10 02:24:48,949 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 1
2015-03-10 02:24:48,951 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 0
2015-03-10 02:24:48,952 INFO crawl.InjectorJob - Injector: finished at 2015-03-10 02:24:48, elapsed: 00:00:08
Hbase scan at this point:
> scan 'hbase'
ROW COLUMN+CELL
0 row(s) in 0.0090 seconds
Also, I am using ubuntu and version of Nutch is 2.3.
I need help identifying the part where I could be missing something critical information in the documentation or pointer to where things could be going wrong.
Thank You!
Sid.
Re: Filter rejecting url
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
the reason is clearly in the URL filters. The single injected
URL does not pass the filter:
> InjectorJob: total number of urls rejected by filters: 1
> InjectorJob: total number of urls injected after normalization and filtering: 0
Please, check which URL filters are activated via property
plugin.includes. And check all configurations files of the active URL filters.
There is also a usefule tool:
bin/nutch org.apache.nutch.net.URLFilterChecker
Cheers,
Sebastian
On 03/10/2015 10:40 AM, Siddhartha Sandhu wrote:
> Hi,
>
> I am running the command:
>
> root@ubuntu:/usr/lib/nutch/nutch/runtime/local/bin# ./nutch inject ../../../urls/
> InjectorJob: starting at 2015-03-10 02:24:40
> InjectorJob: Injecting urlDir: ../../../urls
> InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
> InjectorJob: total number of urls rejected by filters: 1
> InjectorJob: total number of urls injected after normalization and filtering: 0
> Injector: finished at 2015-03-10 02:24:48, elapsed: 00:00:08
>
> My "../../../urls/" contains a txt file with value:
> http://www.yahoo.com
>
> My regex-urlfilter.txt is:
>
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|js|JS)$
> +\.(JPG|jpg|PNG|png|jpeg|JPEG|BMP|bmp)
> # skip URLs containing certain characters as probable queries, etc.
> -.*[*!@].*
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> +.*
>
>
>
> My nutch-site.xml contains:
>
>
>
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>My Nutch Spider</value>
> </property>
>
> <property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.hbase.store.HBaseStore</value>
> <description>Default class for storing data</description>
> </property>
>
> </configuration>
>
> Log entry for corresponding run in nutch/runtime/local/logs/hadoop.log is:
>
>
>
> 2015-03-10 02:24:46,429 WARN snappy.LoadSnappy - Snappy native library not loaded
> 2015-03-10 02:24:47,884 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using
> default
> 2015-03-10 02:24:47,900 WARN mapred.FileOutputCommitter - Output path is null in cleanup
> 2015-03-10 02:24:48,949 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by
> filters: 1
> 2015-03-10 02:24:48,951 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after
> normalization and filtering: 0
> 2015-03-10 02:24:48,952 INFO crawl.InjectorJob - Injector: finished at 2015-03-10 02:24:48,
> elapsed: 00:00:08
>
>
> Hbase scan at this point:
>
>> scan 'hbase'
>
> ROW
> COLUMN+CELL
>
> 0 row(s) in 0.0090 seconds
>
>
> Also, I am using ubuntu and version of Nutch is 2.3.
>
>
> I need help identifying the part where I could be missing something critical information in the
> documentation or pointer to where things could be going wrong.
>
>
> Thank You!
>
> Sid.
>