You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Robert Irribarren <ro...@algorithms.io> on 2012/08/18 09:07:58 UTC

Nutch Fetching alot but SOLR doesn't include all the fetches

I run this
nutch inject urls
nutch generate
bin/nutch crawl urls -depth 3 -topN 100
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
echo Crawling completed
dir

then I see alot of urls being fetched during the crawl phase.
When I run the solrindex it doesn't add all the urls i see when it says
fetching

54 URLs in 5 queues
fetching http://www.tarpits.org/join-us
fetching http://www.leonisadobemuseum.org/history-leonis.asp
fetching http://az.wikipedia.org/wiki/Quercus_prinus

It doesn't add wikipedia nor the others.

ADDITIONAL INFO :
My regex-urlfilter.txt
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
#################################################################

ADDITIONAL INFO : Running on solr 4.0 nutch 2.0

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Robert Irribarren <ro...@algorithms.io>.

I fixed the errors, thanks.

On Sat, Aug 18, 2012 at 1:33 AM, Robert Irribarren <ro...@algorithms.io>wrote:

> And here is my hadoop.log
> 2012-08-18 08:30:13,069 INFO  solr.SolrIndexerJob - SolrIndexerJob:
> starting
> 2012-08-18 08:30:13,658 INFO  plugin.PluginRepository - Plugins: looking
> in: /usr/share/nutch/runtime/local/plugins
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Registered Plugins:
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         the nutch
> core extension points (nutch-extensionpoints)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Basic URL
> Normalizer (urlnormalizer-basic)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Html Parse
> Plug-in (parse-html)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Basic
> Indexing Filter (index-basic)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         HTTP
> Framework (lib-http)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
> Filter (urlfilter-regex)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Http
> Protocol Plug-in (protocol-http)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
> Normalizer (urlnormalizer-regex)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Tika
> Parser Plug-in (parse-tika)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         OPIC
> Scoring Plug-in (scoring-opic)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         CyberNeko
> HTML Parser (lib-nekohtml)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Anchor
> Indexing Filter (index-anchor)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
> Filter Framework (lib-regex-filter)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Parse
> Filter (org.apache.nutch.parse.ParseFilter)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2012-08-18 08:30:13,881 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2012-08-18 08:30:13,883 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2012-08-18 08:30:13,883 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2012-08-18 08:30:14,946 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2012-08-18 08:30:15,960 INFO  mapreduce.GoraRecordReader -
> gora.buffer.read.limit = 10000
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: site dest:
> site
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: digest
> dest: digest
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: tstamp
> dest: tstamp
> 2012-08-18 08:30:16,094 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2012-08-18 08:30:16,094 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2012-08-18 08:30:16,094 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2012-08-18 08:30:16,957 INFO  solr.SolrWriter - Adding 36 documents
> 2012-08-18 08:30:19,859 INFO  solr.SolrIndexerJob - SolrIndexerJob: done.
>
>
>
> On Sat, Aug 18, 2012 at 1:09 AM, Robert Irribarren <ro...@algorithms.io>wrote:
>
>> WebTable statistics start
>> Statistics for WebTable:
>> min score:      0.0
>> status 2 (status_fetched):      1053
>> jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
>> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
>> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
>> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
>> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
>> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
>> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
>> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
>> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
>> File Output Format Counters ={BYTES_WRITTEN=375}}}}
>> retry 0:        1233
>> retry 1:        1
>> TOTAL urls:     1234
>> status 4 (status_redir_temp):   32
>> status 5 (status_redir_perm):   47
>> max score:      1.0
>> status 34 (status_retry):       16
>> status 3 (status_gone): 17
>> status 0 (null):        69
>> avg score:      0.01614992
>> WebTable statistics: done
>> min score:      0.0
>> status 2 (status_fetched):      1053
>> jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
>> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
>> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
>> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
>> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
>> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
>> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
>> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
>> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
>> File Output Format Counters ={BYTES_WRITTEN=375}}}}
>> retry 0:        1233
>> retry 1:        1
>> TOTAL urls:     1234
>> status 4 (status_redir_temp):   32
>> status 5 (status_redir_perm):   47
>> max score:      1.0
>> status 34 (status_retry):       16
>> status 3 (status_gone): 17
>> status 0 (null):        69
>> avg score:      0.01614992
>>
>>
>> This is what the db says but its not really what i see on my solr.
>> Perhaps I didn't set my solr directory somewhere? Please help
>>
>>
>> On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren <robert@algorithms.io
>> > wrote:
>>
>>> Update : I get this after im done crawling
>>>
>>> Parsing http://www.brainpop.co.uk/
>>> Exception in thread "main" java.lang.RuntimeException: job failed:
>>> name=parse, jobid=job_local_0004
>>>         at
>>> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
>>>         at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
>>>         at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>>>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
>>>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>>>
>>>
>>>
>>> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <
>>> robert@algorithms.io> wrote:
>>>
>>>> I actually didnt have it specified, I now put this in the
>>>> nutch-site.xml looks like this.
>>>>
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>>
>>>> <configuration>
>>>> <property>
>>>> <name>http.agent.name</name>
>>>> <value>Balsa  Crawler</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>db.ignore.external.links</name>
>>>>   <value>false</value>
>>>>   <description>If true, outlinks leading from a page to external hosts
>>>>   will be ignored. This is an effective way to limit the crawl to
>>>> include
>>>>   only initially injected hosts, without creating complex URLFilters.
>>>>   </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>storage.data.store.class</name>
>>>> <value>org.apache.gora.sql.store.SqlStore</value>
>>>> <description>The Gora DataStore class for storing and retrieving data.
>>>> Currently the following stores are available: ..
>>>> </description>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>>
>>>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
>>>> sscheffler@avantgarde-labs.de> wrote:
>>>>
>>>>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>>>>> This avoids that external links are fetched.
>>>>> Another problem could be, that the robots.txt of the servers prevents
>>>>> the crawler from fetching.
>>>>> you can check this with *bin/nutch readdb*. There you see, if the
>>>>> sites are really fetched
>>>>> regards
>>>>> Stefan
>>>>>
>>>>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>>>>
>>>>>  I run this
>>>>>> nutch inject urls
>>>>>> nutch generate
>>>>>> bin/nutch crawl urls -depth 3 -topN 100
>>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>>>>> echo Crawling completed
>>>>>> dir
>>>>>>
>>>>>> then I see alot of urls being fetched during the crawl phase.
>>>>>> When I run the solrindex it doesn't add all the urls i see when it
>>>>>> says
>>>>>> fetching
>>>>>>
>>>>>> 54 URLs in 5 queues
>>>>>> fetching http://www.tarpits.org/join-us
>>>>>> fetching http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>>>>> fetching http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>>>>
>>>>>> It doesn't add wikipedia nor the others.
>>>>>>
>>>>>> ADDITIONAL INFO :
>>>>>> My regex-urlfilter.txt
>>>>>> # skip file: ftp: and mailto: urls
>>>>>> -^(file|ftp|mailto):
>>>>>>
>>>>>> # skip image and other suffixes we can't yet parse
>>>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>>>>
>>>>>> # skip URLs containing certain characters as probable queries, etc.
>>>>>> -[?*!@=]
>>>>>>
>>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>>>> break
>>>>>> loops
>>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>>>
>>>>>> # accept anything else
>>>>>> +.
>>>>>> ##############################**##############################**#####
>>>>>>
>>>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Robert Irribarren <ro...@algorithms.io>.

And here is my hadoop.log
2012-08-18 08:30:13,069 INFO  solr.SolrIndexerJob - SolrIndexerJob: starting
2012-08-18 08:30:13,658 INFO  plugin.PluginRepository - Plugins: looking
in: /usr/share/nutch/runtime/local/plugins
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Registered Plugins:
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Tika Parser
Plug-in (parse-tika)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         OPIC
Scoring Plug-in (scoring-opic)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Registered
Extension-Points:
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Parse
Filter (org.apache.nutch.parse.ParseFilter)
2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2012-08-18 08:30:13,881 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2012-08-18 08:30:13,883 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2012-08-18 08:30:13,883 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-08-18 08:30:14,946 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-08-18 08:30:15,960 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: content
dest: content
2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: site dest:
site
2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: title dest:
title
2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: host dest:
host
2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: segment
dest: segment
2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: boost dest:
boost
2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: digest dest:
digest
2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2012-08-18 08:30:16,094 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2012-08-18 08:30:16,094 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2012-08-18 08:30:16,094 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-08-18 08:30:16,957 INFO  solr.SolrWriter - Adding 36 documents
2012-08-18 08:30:19,859 INFO  solr.SolrIndexerJob - SolrIndexerJob: done.


On Sat, Aug 18, 2012 at 1:09 AM, Robert Irribarren <ro...@algorithms.io>wrote:

> WebTable statistics start
> Statistics for WebTable:
> min score:      0.0
> status 2 (status_fetched):      1053
> jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
> File Output Format Counters ={BYTES_WRITTEN=375}}}}
> retry 0:        1233
> retry 1:        1
> TOTAL urls:     1234
> status 4 (status_redir_temp):   32
> status 5 (status_redir_perm):   47
> max score:      1.0
> status 34 (status_retry):       16
> status 3 (status_gone): 17
> status 0 (null):        69
> avg score:      0.01614992
> WebTable statistics: done
> min score:      0.0
> status 2 (status_fetched):      1053
> jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
> File Output Format Counters ={BYTES_WRITTEN=375}}}}
> retry 0:        1233
> retry 1:        1
> TOTAL urls:     1234
> status 4 (status_redir_temp):   32
> status 5 (status_redir_perm):   47
> max score:      1.0
> status 34 (status_retry):       16
> status 3 (status_gone): 17
> status 0 (null):        69
> avg score:      0.01614992
>
>
> This is what the db says but its not really what i see on my solr. Perhaps
> I didn't set my solr directory somewhere? Please help
>
>
> On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren <ro...@algorithms.io>wrote:
>
>> Update : I get this after im done crawling
>>
>> Parsing http://www.brainpop.co.uk/
>> Exception in thread "main" java.lang.RuntimeException: job failed:
>> name=parse, jobid=job_local_0004
>>         at
>> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
>>         at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
>>         at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
>>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>>
>>
>>
>> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <robert@algorithms.io
>> > wrote:
>>
>>> I actually didnt have it specified, I now put this in the nutch-site.xml
>>> looks like this.
>>>
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>
>>> <!-- Put site-specific property overrides in this file. -->
>>>
>>> <configuration>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>Balsa  Crawler</value>
>>> </property>
>>>
>>> <property>
>>>   <name>db.ignore.external.links</name>
>>>   <value>false</value>
>>>   <description>If true, outlinks leading from a page to external hosts
>>>   will be ignored. This is an effective way to limit the crawl to include
>>>   only initially injected hosts, without creating complex URLFilters.
>>>   </description>
>>> </property>
>>>
>>> <property>
>>> <name>storage.data.store.class</name>
>>> <value>org.apache.gora.sql.store.SqlStore</value>
>>> <description>The Gora DataStore class for storing and retrieving data.
>>> Currently the following stores are available: ..
>>> </description>
>>> </property>
>>>
>>> </configuration>
>>>
>>>
>>>
>>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
>>> sscheffler@avantgarde-labs.de> wrote:
>>>
>>>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>>>> This avoids that external links are fetched.
>>>> Another problem could be, that the robots.txt of the servers prevents
>>>> the crawler from fetching.
>>>> you can check this with *bin/nutch readdb*. There you see, if the sites
>>>> are really fetched
>>>> regards
>>>> Stefan
>>>>
>>>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>>>
>>>>  I run this
>>>>> nutch inject urls
>>>>> nutch generate
>>>>> bin/nutch crawl urls -depth 3 -topN 100
>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>>>> echo Crawling completed
>>>>> dir
>>>>>
>>>>> then I see alot of urls being fetched during the crawl phase.
>>>>> When I run the solrindex it doesn't add all the urls i see when it says
>>>>> fetching
>>>>>
>>>>> 54 URLs in 5 queues
>>>>> fetching http://www.tarpits.org/join-us
>>>>> fetching http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>>>> fetching http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>>>
>>>>> It doesn't add wikipedia nor the others.
>>>>>
>>>>> ADDITIONAL INFO :
>>>>> My regex-urlfilter.txt
>>>>> # skip file: ftp: and mailto: urls
>>>>> -^(file|ftp|mailto):
>>>>>
>>>>> # skip image and other suffixes we can't yet parse
>>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>>>
>>>>> # skip URLs containing certain characters as probable queries, etc.
>>>>> -[?*!@=]
>>>>>
>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>>> break
>>>>> loops
>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>>
>>>>> # accept anything else
>>>>> +.
>>>>> ##############################**##############################**#####
>>>>>
>>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Robert Irribarren <ro...@algorithms.io>.

WebTable statistics start
Statistics for WebTable:
min score:      0.0
status 2 (status_fetched):      1053
jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
File Output Format Counters ={BYTES_WRITTEN=375}}}}
retry 0:        1233
retry 1:        1
TOTAL urls:     1234
status 4 (status_redir_temp):   32
status 5 (status_redir_perm):   47
max score:      1.0
status 34 (status_retry):       16
status 3 (status_gone): 17
status 0 (null):        69
avg score:      0.01614992
WebTable statistics: done
min score:      0.0
status 2 (status_fetched):      1053
jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
File Output Format Counters ={BYTES_WRITTEN=375}}}}
retry 0:        1233
retry 1:        1
TOTAL urls:     1234
status 4 (status_redir_temp):   32
status 5 (status_redir_perm):   47
max score:      1.0
status 34 (status_retry):       16
status 3 (status_gone): 17
status 0 (null):        69
avg score:      0.01614992


This is what the db says but its not really what i see on my solr. Perhaps
I didn't set my solr directory somewhere? Please help

On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren <ro...@algorithms.io>wrote:

> Update : I get this after im done crawling
>
> Parsing http://www.brainpop.co.uk/
> Exception in thread "main" java.lang.RuntimeException: job failed:
> name=parse, jobid=job_local_0004
>         at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
>         at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
>         at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>
>
>
> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <ro...@algorithms.io>wrote:
>
>> I actually didnt have it specified, I now put this in the nutch-site.xml
>> looks like this.
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>> <name>http.agent.name</name>
>> <value>Balsa  Crawler</value>
>> </property>
>>
>> <property>
>>   <name>db.ignore.external.links</name>
>>   <value>false</value>
>>   <description>If true, outlinks leading from a page to external hosts
>>   will be ignored. This is an effective way to limit the crawl to include
>>   only initially injected hosts, without creating complex URLFilters.
>>   </description>
>> </property>
>>
>> <property>
>> <name>storage.data.store.class</name>
>> <value>org.apache.gora.sql.store.SqlStore</value>
>> <description>The Gora DataStore class for storing and retrieving data.
>> Currently the following stores are available: ..
>> </description>
>> </property>
>>
>> </configuration>
>>
>>
>>
>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
>> sscheffler@avantgarde-labs.de> wrote:
>>
>>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>>> This avoids that external links are fetched.
>>> Another problem could be, that the robots.txt of the servers prevents
>>> the crawler from fetching.
>>> you can check this with *bin/nutch readdb*. There you see, if the sites
>>> are really fetched
>>> regards
>>> Stefan
>>>
>>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>>
>>>  I run this
>>>> nutch inject urls
>>>> nutch generate
>>>> bin/nutch crawl urls -depth 3 -topN 100
>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>>> echo Crawling completed
>>>> dir
>>>>
>>>> then I see alot of urls being fetched during the crawl phase.
>>>> When I run the solrindex it doesn't add all the urls i see when it says
>>>> fetching
>>>>
>>>> 54 URLs in 5 queues
>>>> fetching http://www.tarpits.org/join-us
>>>> fetching http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>>> fetching http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>>
>>>> It doesn't add wikipedia nor the others.
>>>>
>>>> ADDITIONAL INFO :
>>>> My regex-urlfilter.txt
>>>> # skip file: ftp: and mailto: urls
>>>> -^(file|ftp|mailto):
>>>>
>>>> # skip image and other suffixes we can't yet parse
>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>>
>>>> # skip URLs containing certain characters as probable queries, etc.
>>>> -[?*!@=]
>>>>
>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>>> loops
>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>
>>>> # accept anything else
>>>> +.
>>>> ##############################**##############################**#####
>>>>
>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>>
>>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Robert Irribarren <ro...@algorithms.io>.

Update : I get this after im done crawling

Parsing http://www.brainpop.co.uk/
Exception in thread "main" java.lang.RuntimeException: job failed:
name=parse, jobid=job_local_0004
        at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
        at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
        at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
        at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
        at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)


On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <ro...@algorithms.io>wrote:

> I actually didnt have it specified, I now put this in the nutch-site.xml
> looks like this.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>Balsa  Crawler</value>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> <property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.sql.store.SqlStore</value>
> <description>The Gora DataStore class for storing and retrieving data.
> Currently the following stores are available: ..
> </description>
> </property>
>
> </configuration>
>
>
>
> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
> sscheffler@avantgarde-labs.de> wrote:
>
>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>> This avoids that external links are fetched.
>> Another problem could be, that the robots.txt of the servers prevents the
>> crawler from fetching.
>> you can check this with *bin/nutch readdb*. There you see, if the sites
>> are really fetched
>> regards
>> Stefan
>>
>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>
>>  I run this
>>> nutch inject urls
>>> nutch generate
>>> bin/nutch crawl urls -depth 3 -topN 100
>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>> echo Crawling completed
>>> dir
>>>
>>> then I see alot of urls being fetched during the crawl phase.
>>> When I run the solrindex it doesn't add all the urls i see when it says
>>> fetching
>>>
>>> 54 URLs in 5 queues
>>> fetching http://www.tarpits.org/join-us
>>> fetching http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>> fetching http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>
>>> It doesn't add wikipedia nor the others.
>>>
>>> ADDITIONAL INFO :
>>> My regex-urlfilter.txt
>>> # skip file: ftp: and mailto: urls
>>> -^(file|ftp|mailto):
>>>
>>> # skip image and other suffixes we can't yet parse
>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> -[?*!@=]
>>>
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>
>>> # accept anything else
>>> +.
>>> ##############################**##############################**#####
>>>
>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Robert Irribarren <ro...@algorithms.io>.

That actually didn't fix it yet. I only have the same number of results.
When I run the
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

The actual url when i search is
http://127.0.0.1:8983/solr/#/~cores/collection1

So should i do

bin/nutch solrindex http://127.0.0.1:8983/solr/#/~cores/collection1
or
bin/nutch solrindex http://127.0.0.1:8983/solr/collection1

On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <ro...@algorithms.io>wrote:

> I actually didnt have it specified, I now put this in the nutch-site.xml
> looks like this.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>Balsa  Crawler</value>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> <property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.sql.store.SqlStore</value>
> <description>The Gora DataStore class for storing and retrieving data.
> Currently the following stores are available: ..
> </description>
> </property>
>
> </configuration>
>
>
>
> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
> sscheffler@avantgarde-labs.de> wrote:
>
>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>> This avoids that external links are fetched.
>> Another problem could be, that the robots.txt of the servers prevents the
>> crawler from fetching.
>> you can check this with *bin/nutch readdb*. There you see, if the sites
>> are really fetched
>> regards
>> Stefan
>>
>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>
>>  I run this
>>> nutch inject urls
>>> nutch generate
>>> bin/nutch crawl urls -depth 3 -topN 100
>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>> echo Crawling completed
>>> dir
>>>
>>> then I see alot of urls being fetched during the crawl phase.
>>> When I run the solrindex it doesn't add all the urls i see when it says
>>> fetching
>>>
>>> 54 URLs in 5 queues
>>> fetching http://www.tarpits.org/join-us
>>> fetching http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>> fetching http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>
>>> It doesn't add wikipedia nor the others.
>>>
>>> ADDITIONAL INFO :
>>> My regex-urlfilter.txt
>>> # skip file: ftp: and mailto: urls
>>> -^(file|ftp|mailto):
>>>
>>> # skip image and other suffixes we can't yet parse
>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> -[?*!@=]
>>>
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>
>>> # accept anything else
>>> +.
>>> ##############################**##############################**#####
>>>
>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Robert Irribarren <ro...@algorithms.io>.

I actually didnt have it specified, I now put this in the nutch-site.xml
looks like this.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>Balsa  Crawler</value>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ..
</description>
</property>

</configuration>


On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
sscheffler@avantgarde-labs.de> wrote:

> Did you set db.ignore.external in *conf/nutch-site.xml*?
> This avoids that external links are fetched.
> Another problem could be, that the robots.txt of the servers prevents the
> crawler from fetching.
> you can check this with *bin/nutch readdb*. There you see, if the sites
> are really fetched
> regards
> Stefan
>
> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>
>  I run this
>> nutch inject urls
>> nutch generate
>> bin/nutch crawl urls -depth 3 -topN 100
>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>> echo Crawling completed
>> dir
>>
>> then I see alot of urls being fetched during the crawl phase.
>> When I run the solrindex it doesn't add all the urls i see when it says
>> fetching
>>
>> 54 URLs in 5 queues
>> fetching http://www.tarpits.org/join-us
>> fetching http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>> fetching http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>
>> It doesn't add wikipedia nor the others.
>>
>> ADDITIONAL INFO :
>> My regex-urlfilter.txt
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>>
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> # accept anything else
>> +.
>> ##############################**##############################**#####
>>
>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.

Did you set db.ignore.external in *conf/nutch-site.xml*?
This avoids that external links are fetched.
Another problem could be, that the robots.txt of the servers prevents 
the crawler from fetching.
you can check this with *bin/nutch readdb*. There you see, if the sites 
are really fetched
regards
Stefan

Am 18.08.2012 09:07, schrieb Robert Irribarren:
> I run this
> nutch inject urls
> nutch generate
> bin/nutch crawl urls -depth 3 -topN 100
> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
> echo Crawling completed
> dir
>
> then I see alot of urls being fetched during the crawl phase.
> When I run the solrindex it doesn't add all the urls i see when it says
> fetching
>
> 54 URLs in 5 queues
> fetching http://www.tarpits.org/join-us
> fetching http://www.leonisadobemuseum.org/history-leonis.asp
> fetching http://az.wikipedia.org/wiki/Quercus_prinus
>
> It doesn't add wikipedia nor the others.
>
> ADDITIONAL INFO :
> My regex-urlfilter.txt
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> +.
> #################################################################
>
> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>