You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2014/06/05 06:38:24 UTC

Crawling local file system - file not parse

Hi,

I'm sure this is an "old" topic, but I still no luck crawling with it.
It's a little bit harder than crawling web / http protocol :(

Following are some important files I configured:

(1) urls/seed.txt
file://opt/searchengine/test/

which contains one file:
-rw-r--r-- 1 bayu bayu 3272 Jun  5 10:02 Testdocumentsaja.pdf

(2) regex-urlfilter.txt: allowing file: protocol and accept path URL
-^(ftp|mailto):
+^file://opt/searchengine/test

(3) nutch-site.xml : enabling protocol-file
<property>
  <name>plugin.includes</name>

<value>protocol-(http|file)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

For the crawl nutch script using common steps (inject - generate - fetch -
parse - updatedb - solrindex - solrdedup).
>From the hadoop.log below, nutch could fetch file protocol path, but it
never parse the file inside /opt/searchengine/test/.

hadoop.log:

2014-06-05 10:33:33,274 INFO  crawl.Injector - Injector: starting at
2014-06-05 10:33:33
2014-06-05 10:33:33,276 INFO  crawl.Injector - Injector: crawlDb:
/opt/searchengine/nutch/BWCrawl/crawldb
2014-06-05 10:33:33,276 INFO  crawl.Injector - Injector: urlDir:
/opt/searchengine/nutch/urls/seed.txt
2014-06-05 10:33:33,277 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2014-06-05 10:33:33,714 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:33,807 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2014-06-05 10:33:34,717 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2014-06-05 10:33:35,127 INFO  crawl.Injector - Injector: total number of
urls rejected by filters: 0
2014-06-05 10:33:35,131 INFO  crawl.Injector - Injector: total number of
urls injected after normalization and filtering: 1
2014-06-05 10:33:35,132 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2014-06-05 10:33:35,396 INFO  crawl.Injector - Injector: overwrite: false
2014-06-05 10:33:35,397 INFO  crawl.Injector - Injector: update: false
2014-06-05 10:33:36,357 INFO  crawl.Injector - Injector: finished at
2014-06-05 10:33:36, elapsed: 00:00:03
2014-06-05 10:33:37,857 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:37,863 INFO  crawl.Generator - Generator: starting at
2014-06-05 10:33:37
2014-06-05 10:33:37,863 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2014-06-05 10:33:37,864 INFO  crawl.Generator - Generator: filtering: true
2014-06-05 10:33:37,865 INFO  crawl.Generator - Generator: normalizing: true
2014-06-05 10:33:37,876 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2014-06-05 10:33:38,915 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-05 10:33:38,916 INFO  crawl.AbstractFetchSchedule -
defaultInterval=129600
2014-06-05 10:33:38,917 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-06-05 10:33:38,929 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2014-06-05 10:33:39,006 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-05 10:33:39,007 INFO  crawl.AbstractFetchSchedule -
defaultInterval=129600
2014-06-05 10:33:39,007 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-06-05 10:33:39,015 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2014-06-05 10:33:39,384 INFO  crawl.Generator - Generator: Partitioning
selected urls for politeness.
2014-06-05 10:33:40,386 INFO  crawl.Generator - Generator: segment:
/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:40,593 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2014-06-05 10:33:41,540 INFO  crawl.Generator - Generator: finished at
2014-06-05 10:33:41, elapsed: 00:00:03
2014-06-05 10:33:42,634 INFO  fetcher.Fetcher - Fetcher: starting at
2014-06-05 10:33:42
2014-06-05 10:33:42,635 INFO  fetcher.Fetcher - Fetcher: segment:
/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:43,056 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:43,719 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:43,720 INFO  fetcher.Fetcher - Fetcher: threads: 10
2014-06-05 10:33:43,720 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 4
2014-06-05 10:33:43,739 INFO  fetcher.Fetcher - QueueFeeder finished: total
1 records + hit by time limit :0
2014-06-05 10:33:44,102 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,103 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,104 INFO  fetcher.Fetcher - fetching
file://opt/searchengine/test/ (queue crawl delay=5000ms)
2014-06-05 10:33:44,106 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,107 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,111 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,111 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,118 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,120 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,121 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,122 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,122 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,127 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,129 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,130 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,131 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,132 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,133 INFO  fetcher.Fetcher - Using queue mode : byHost
2014-06-05 10:33:44,146 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,149 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2014-06-05 10:33:44,149 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2014-06-05 10:33:44,150 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2014-06-05 10:33:44,423 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2014-06-05 10:33:45,151 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2014-06-05 10:33:45,153 INFO  fetcher.Fetcher - -activeThreads=0
2014-06-05 10:33:45,497 INFO  fetcher.Fetcher - Fetcher: finished at
2014-06-05 10:33:45, elapsed: 00:00:02
2014-06-05 10:33:46,660 INFO  parse.ParseSegment - ParseSegment: starting
at 2014-06-05 10:33:46
2014-06-05 10:33:46,661 INFO  parse.ParseSegment - ParseSegment: segment:
/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:47,094 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:48,527 INFO  parse.ParseSegment - ParseSegment: finished
at 2014-06-05 10:33:48, elapsed: 00:00:01
2014-06-05 10:33:49,949 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:49,995 INFO  crawl.CrawlDb - CrawlDb update: starting at
2014-06-05 10:33:49
2014-06-05 10:33:49,996 INFO  crawl.CrawlDb - CrawlDb update: db:
/opt/searchengine/nutch/BWCrawl/crawldb
2014-06-05 10:33:49,997 INFO  crawl.CrawlDb - CrawlDb update: segments:
[/opt/searchengine/nutch/BWCrawl/segments/20140605103340]
2014-06-05 10:33:50,002 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: URL
filtering: true
2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
false
2014-06-05 10:33:50,006 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2014-06-05 10:33:51,150 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2014-06-05 10:33:51,242 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2014-06-05 10:33:51,399 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-05 10:33:51,399 INFO  crawl.AbstractFetchSchedule -
defaultInterval=129600
2014-06-05 10:33:51,399 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2014-06-05 10:33:51,537 INFO  crawl.CrawlDb - CrawlDb update: finished at
2014-06-05 10:33:51, elapsed: 00:00:01
2014-06-05 10:33:53,008 INFO  indexer.IndexingJob - Indexer: starting at
2014-06-05 10:33:53
2014-06-05 10:33:53,024 INFO  indexer.IndexingJob - Indexer: deleting gone
documents: false
2014-06-05 10:33:53,025 INFO  indexer.IndexingJob - Indexer: URL filtering:
false
2014-06-05 10:33:53,027 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2014-06-05 10:33:53,373 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-06-05 10:33:53,385 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


2014-06-05 10:33:53,396 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: /opt/searchengine/nutch/BWCrawl/crawldb
2014-06-05 10:33:53,396 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: file:/opt/searchengine/nutch/BWCrawl/segments/20140605103340
2014-06-05 10:33:53,464 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-06-05 10:33:54,214 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2014-06-05 10:33:54,532 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: content
dest: content
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: title dest:
title
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: author dest:
author
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: host dest:
host
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: segment
dest: segment
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: boost dest:
boost
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: digest dest:
digest
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: url dest: id
2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: url dest: url
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: content
dest: content
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: title dest:
title
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: author dest:
author
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: host dest:
host
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: segment
dest: segment
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: boost dest:
boost
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: digest dest:
digest
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: url dest: id
2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: url dest: url
2014-06-05 10:33:55,063 INFO  indexer.IndexingJob - Indexer: finished at
2014-06-05 10:33:55, elapsed: 00:00:02

Result of nutch readdb:
CrawlDb statistics start: BWCrawl/crawldb/
Statistics for CrawlDb: BWCrawl/crawldb/
TOTAL urls:     1
retry 0:        1
min score:      1.0
avg score:      1.0
max score:      1.0
status 3 (db_gone):     1
CrawlDb statistics: done

Following are some of documents I've read:

   - http://wiki.apache.org/nutch/IntranetDocumentSearch
   - http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
   -
   http://lucene.472066.n3.nabble.com/Crawling-the-local-file-system-with-Nutch-Document-td607747.html

System: Ubuntu 14.04, nutch 1.8, Solr 4.8.0.
I really appreciate if someone could share some hints or any
"running-proof" references for this subject.

Thank you.-

-- 
wassalam,
[bayu]

Re: Crawling local file system - file not parse

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Sebastian,

Thank you for the info.
I'll try the workaround as comments suggested.


On Fri, Jun 6, 2014 at 4:26 AM, Sebastian Nagel <wa...@googlemail.com>
wrote:

> Hi Bayu,
>
> there is an open issue with file URLs, see
> https://issues.apache.org/jira/browse/NUTCH-1483
>
> Hope the information helps,
> Sebastian
>
>
> On 06/05/2014 06:38 AM, Bayu Widyasanyata wrote:
> > Hi,
> >
> > I'm sure this is an "old" topic, but I still no luck crawling with it.
> > It's a little bit harder than crawling web / http protocol :(
> >
> > Following are some important files I configured:
> >
> > (1) urls/seed.txt
> > file://opt/searchengine/test/
> >
> > which contains one file:
> > -rw-r--r-- 1 bayu bayu 3272 Jun  5 10:02 Testdocumentsaja.pdf
> >
> > (2) regex-urlfilter.txt: allowing file: protocol and accept path URL
> > -^(ftp|mailto):
> > +^file://opt/searchengine/test
> >
> > (3) nutch-site.xml : enabling protocol-file
> > <property>
> >   <name>plugin.includes</name>
> >
> >
> <value>protocol-(http|file)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >   <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints
> plugin. By
> >   default Nutch includes crawling just HTML and plain text via HTTP,
> >   and basic indexing and search plugins. In order to use HTTPS please
> enable
> >   protocol-httpclient, but be aware of possible intermittent problems
> with
> > the
> >   underlying commons-httpclient library.
> >   </description>
> > </property>
> >
> > For the crawl nutch script using common steps (inject - generate - fetch
> -
> > parse - updatedb - solrindex - solrdedup).
> > From the hadoop.log below, nutch could fetch file protocol path, but it
> > never parse the file inside /opt/searchengine/test/.
> >
> > hadoop.log:
> >
> > 2014-06-05 10:33:33,274 INFO  crawl.Injector - Injector: starting at
> > 2014-06-05 10:33:33
> > 2014-06-05 10:33:33,276 INFO  crawl.Injector - Injector: crawlDb:
> > /opt/searchengine/nutch/BWCrawl/crawldb
> > 2014-06-05 10:33:33,276 INFO  crawl.Injector - Injector: urlDir:
> > /opt/searchengine/nutch/urls/seed.txt
> > 2014-06-05 10:33:33,277 INFO  crawl.Injector - Injector: Converting
> > injected urls to crawl db entries.
> > 2014-06-05 10:33:33,714 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-06-05 10:33:33,807 WARN  snappy.LoadSnappy - Snappy native library
> not
> > loaded
> > 2014-06-05 10:33:34,717 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'inject', using default
> > 2014-06-05 10:33:35,127 INFO  crawl.Injector - Injector: total number of
> > urls rejected by filters: 0
> > 2014-06-05 10:33:35,131 INFO  crawl.Injector - Injector: total number of
> > urls injected after normalization and filtering: 1
> > 2014-06-05 10:33:35,132 INFO  crawl.Injector - Injector: Merging injected
> > urls into crawl db.
> > 2014-06-05 10:33:35,396 INFO  crawl.Injector - Injector: overwrite: false
> > 2014-06-05 10:33:35,397 INFO  crawl.Injector - Injector: update: false
> > 2014-06-05 10:33:36,357 INFO  crawl.Injector - Injector: finished at
> > 2014-06-05 10:33:36, elapsed: 00:00:03
> > 2014-06-05 10:33:37,857 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-06-05 10:33:37,863 INFO  crawl.Generator - Generator: starting at
> > 2014-06-05 10:33:37
> > 2014-06-05 10:33:37,863 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > 2014-06-05 10:33:37,864 INFO  crawl.Generator - Generator: filtering:
> true
> > 2014-06-05 10:33:37,865 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2014-06-05 10:33:37,876 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > 2014-06-05 10:33:38,915 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2014-06-05 10:33:38,916 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=129600
> > 2014-06-05 10:33:38,917 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> > 2014-06-05 10:33:38,929 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'partition', using default
> > 2014-06-05 10:33:39,006 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2014-06-05 10:33:39,007 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=129600
> > 2014-06-05 10:33:39,007 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> > 2014-06-05 10:33:39,015 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'generate_host_count', using default
> > 2014-06-05 10:33:39,384 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > 2014-06-05 10:33:40,386 INFO  crawl.Generator - Generator: segment:
> > /opt/searchengine/nutch/BWCrawl/segments/20140605103340
> > 2014-06-05 10:33:40,593 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'partition', using default
> > 2014-06-05 10:33:41,540 INFO  crawl.Generator - Generator: finished at
> > 2014-06-05 10:33:41, elapsed: 00:00:03
> > 2014-06-05 10:33:42,634 INFO  fetcher.Fetcher - Fetcher: starting at
> > 2014-06-05 10:33:42https://issues.apache.org/jira/browse/NUTCH-1483
> > 2014-06-05 10:33:42,635 INFO  fetcher.Fetcher - Fetcher: segment:
> > /opt/searchengine/nutch/BWCrawl/segments/20140605103340
> > 2014-06-05 10:33:43,056 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-06-05 10:33:43,719 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:43,720 INFO  fetcher.Fetcher - Fetcher: threads: 10
> > 2014-06-05 10:33:43,720 INFO  fetcher.Fetcher - Fetcher: time-out
> divisor: 4
> > 2014-06-05 10:33:43,739 INFO  fetcher.Fetcher - QueueFeeder finished:
> total
> > 1 records + hit by time limit :0
> > 2014-06-05 10:33:44,102 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,103 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,104 INFO  fetcher.Fetcher - fetching
> > file://opt/searchengine/test/ (queue crawl delay=5000ms)
> > 2014-06-05 10:33:44,106 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,107 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,111 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,111 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,118 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,120 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,121 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,122 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,122 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,127 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,129 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,130 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,131 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,132 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,133 INFO  fetcher.Fetcher - Using queue mode : byHost
> > 2014-06-05 10:33:44,146 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,149 INFO  fetcher.Fetcher - Fetcher: throughput
> > threshold: -1
> > 2014-06-05 10:33:44,149 INFO  fetcher.Fetcher - Fetcher: throughput
> > threshold retries: 5
> > 2014-06-05 10:33:44,150 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=1
> > 2014-06-05 10:33:44,423 INFO  fetcher.Fetcher - -finishing thread
> > FetcherThread, activeThreads=0
> > 2014-06-05 10:33:45,151 INFO  fetcher.Fetcher - -activeThreads=0,
> > spinWaiting=0, fetchQueues.totalSize=0
> > 2014-06-05 10:33:45,153 INFO  fetcher.Fetcher - -activeThreads=0
> > 2014-06-05 10:33:45,497 INFO  fetcher.Fetcher - Fetcher: finished at
> > 2014-06-05 10:33:45, elapsed: 00:00:02
> > 2014-06-05 10:33:46,660 INFO  parse.ParseSegment - ParseSegment: starting
> > at 2014-06-05 10:33:46
> > 2014-06-05 10:33:46,661 INFO  parse.ParseSegment - ParseSegment: segment:
> > /opt/searchengine/nutch/BWCrawl/segments/20140605103340
> > 2014-06-05 10:33:47,094 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-06-05 10:33:48,527 INFO  parse.ParseSegment - ParseSegment: finished
> > at 2014-06-05 10:33:48, elapsed: 00:00:01
> > 2014-06-05 10:33:49,949 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-06-05 10:33:49,995 INFO  crawl.CrawlDb - CrawlDb update: starting at
> > 2014-06-05 10:33:49
> > 2014-06-05 10:33:49,996 INFO  crawl.CrawlDb - CrawlDb update: db:
> > /opt/searchengine/nutch/BWCrawl/crawldb
> > 2014-06-05 10:33:49,997 INFO  crawl.CrawlDb - CrawlDb update: segments:
> > [/opt/searchengine/nutch/BWCrawl/segments/20140605103340]
> > 2014-06-05 10:33:50,002 INFO  crawl.CrawlDb - CrawlDb update: additions
> > allowed: true
> > 2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: URL
> > normalizing: true
> > 2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: URL
> > filtering: true
> > 2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: 404
> purging:
> > false
> > 2014-06-05 10:33:50,006 INFO  crawl.CrawlDb - CrawlDb update: Merging
> > segment data into db.
> > 2014-06-05 10:33:51,150 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'crawldb', using default
> > 2014-06-05 10:33:51,242 INFO  regex.RegexURLNormalizer - can't find rules
> > for scope 'crawldb', using default
> > 2014-06-05 10:33:51,399 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2014-06-05 10:33:51,399 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=129600
> > 2014-06-05 10:33:51,399 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> > 2014-06-05 10:33:51,537 INFO  crawl.CrawlDb - CrawlDb update: finished at
> > 2014-06-05 10:33:51, elapsed: 00:00:01
> > 2014-06-05 10:33:53,008 INFO  indexer.IndexingJob - Indexer: starting at
> > 2014-06-05 10:33:53
> > 2014-06-05 10:33:53,024 INFO  indexer.IndexingJob - Indexer: deleting
> gone
> > documents: false
> > 2014-06-05 10:33:53,025 INFO  indexer.IndexingJob - Indexer: URL
> filtering:
> > false
> > 2014-06-05 10:33:53,027 INFO  indexer.IndexingJob - Indexer: URL
> > normalizing: false
> > 2014-06-05 10:33:53,373 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > 2014-06-05 10:33:53,385 INFO  indexer.IndexingJob - Active IndexWriters :
> > SOLRIndexWriter
> >         solr.server.url : URL of the SOLR instance (mandatory)
> >         solr.commit.size : buffer size when sending to SOLR (default
> 1000)
> >         solr.mapping.file : name of the mapping file for fields (default
> > solrindex-mapping.xml)
> >         solr.auth : use authentication (default false)
> >         solr.auth.username : use authentication (default false)
> >         solr.auth : username for authentication
> >         solr.auth.password : password for authentication
> >
> >
> > 2014-06-05 10:33:53,396 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: /opt/searchengine/nutch/BWCrawl/crawldb
> > 2014-06-05 10:33:53,396 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment:
> file:/opt/searchengine/nutch/BWCrawl/segments/20140605103340
> > 2014-06-05 10:33:53,464 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-06-05 10:33:54,214 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2014-06-05 10:33:54,532 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: content
> > dest: content
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: title
> dest:
> > title
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: author
> dest:
> > author
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: host dest:
> > host
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: segment
> > dest: segment
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: boost
> dest:
> > boost
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: digest
> dest:
> > digest
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: tstamp
> dest:
> > tstamp
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: url dest:
> id
> > 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: url dest:
> url
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: content
> > dest: content
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: title
> dest:
> > title
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: author
> dest:
> > author
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: host dest:
> > host
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: segment
> > dest: segment
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: boost
> dest:
> > boost
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: digest
> dest:
> > digest
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: tstamp
> dest:
> > tstamp
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: url dest:
> id
> > 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: url dest:
> url
> > 2014-06-05 10:33:55,063 INFO  indexer.IndexingJob - Indexer: finished at
> > 2014-06-05 10:33:55, elapsed: 00:00:02
> >
> > Result of nutch readdb:
> > CrawlDb statistics start: BWCrawl/crawldb/
> > Statistics for CrawlDb: BWCrawl/crawldb/
> > TOTAL urls:     1
> > retry 0:        1
> > min score:      1.0
> > avg score:      1.0
> > max score:      1.0
> > status 3 (db_gone):     1
> > CrawlDb statistics: done
> >
> > Following are some of documents I've read:
> >
> >    - http://wiki.apache.org/nutch/IntranetDocumentSearch
> >    -
> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
> >    -
> >
> http://lucene.472066.n3.nabble.com/Crawling-the-local-file-system-with-Nutch-Document-td607747.html
> >
> > System: Ubuntu 14.04, nutch 1.8, Solr 4.8.0.
> > I really appreciate if someone could share some hints or any
> > "running-proof" references for this subject.
> >
> > Thank you.-
> >
>
>


-- 
wassalam,
[bayu]

Re: Crawling local file system - file not parse

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Bayu,

there is an open issue with file URLs, see
https://issues.apache.org/jira/browse/NUTCH-1483

Hope the information helps,
Sebastian


On 06/05/2014 06:38 AM, Bayu Widyasanyata wrote:
> Hi,
> 
> I'm sure this is an "old" topic, but I still no luck crawling with it.
> It's a little bit harder than crawling web / http protocol :(
> 
> Following are some important files I configured:
> 
> (1) urls/seed.txt
> file://opt/searchengine/test/
> 
> which contains one file:
> -rw-r--r-- 1 bayu bayu 3272 Jun  5 10:02 Testdocumentsaja.pdf
> 
> (2) regex-urlfilter.txt: allowing file: protocol and accept path URL
> -^(ftp|mailto):
> +^file://opt/searchengine/test
> 
> (3) nutch-site.xml : enabling protocol-file
> <property>
>   <name>plugin.includes</name>
> 
> <value>protocol-(http|file)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
> 
> For the crawl nutch script using common steps (inject - generate - fetch -
> parse - updatedb - solrindex - solrdedup).
> From the hadoop.log below, nutch could fetch file protocol path, but it
> never parse the file inside /opt/searchengine/test/.
> 
> hadoop.log:
> 
> 2014-06-05 10:33:33,274 INFO  crawl.Injector - Injector: starting at
> 2014-06-05 10:33:33
> 2014-06-05 10:33:33,276 INFO  crawl.Injector - Injector: crawlDb:
> /opt/searchengine/nutch/BWCrawl/crawldb
> 2014-06-05 10:33:33,276 INFO  crawl.Injector - Injector: urlDir:
> /opt/searchengine/nutch/urls/seed.txt
> 2014-06-05 10:33:33,277 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2014-06-05 10:33:33,714 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-06-05 10:33:33,807 WARN  snappy.LoadSnappy - Snappy native library not
> loaded
> 2014-06-05 10:33:34,717 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'inject', using default
> 2014-06-05 10:33:35,127 INFO  crawl.Injector - Injector: total number of
> urls rejected by filters: 0
> 2014-06-05 10:33:35,131 INFO  crawl.Injector - Injector: total number of
> urls injected after normalization and filtering: 1
> 2014-06-05 10:33:35,132 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2014-06-05 10:33:35,396 INFO  crawl.Injector - Injector: overwrite: false
> 2014-06-05 10:33:35,397 INFO  crawl.Injector - Injector: update: false
> 2014-06-05 10:33:36,357 INFO  crawl.Injector - Injector: finished at
> 2014-06-05 10:33:36, elapsed: 00:00:03
> 2014-06-05 10:33:37,857 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-06-05 10:33:37,863 INFO  crawl.Generator - Generator: starting at
> 2014-06-05 10:33:37
> 2014-06-05 10:33:37,863 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2014-06-05 10:33:37,864 INFO  crawl.Generator - Generator: filtering: true
> 2014-06-05 10:33:37,865 INFO  crawl.Generator - Generator: normalizing: true
> 2014-06-05 10:33:37,876 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> 2014-06-05 10:33:38,915 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2014-06-05 10:33:38,916 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=129600
> 2014-06-05 10:33:38,917 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2014-06-05 10:33:38,929 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'partition', using default
> 2014-06-05 10:33:39,006 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2014-06-05 10:33:39,007 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=129600
> 2014-06-05 10:33:39,007 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2014-06-05 10:33:39,015 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'generate_host_count', using default
> 2014-06-05 10:33:39,384 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> 2014-06-05 10:33:40,386 INFO  crawl.Generator - Generator: segment:
> /opt/searchengine/nutch/BWCrawl/segments/20140605103340
> 2014-06-05 10:33:40,593 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'partition', using default
> 2014-06-05 10:33:41,540 INFO  crawl.Generator - Generator: finished at
> 2014-06-05 10:33:41, elapsed: 00:00:03
> 2014-06-05 10:33:42,634 INFO  fetcher.Fetcher - Fetcher: starting at
> 2014-06-05 10:33:42https://issues.apache.org/jira/browse/NUTCH-1483
> 2014-06-05 10:33:42,635 INFO  fetcher.Fetcher - Fetcher: segment:
> /opt/searchengine/nutch/BWCrawl/segments/20140605103340
> 2014-06-05 10:33:43,056 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-06-05 10:33:43,719 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:43,720 INFO  fetcher.Fetcher - Fetcher: threads: 10
> 2014-06-05 10:33:43,720 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 4
> 2014-06-05 10:33:43,739 INFO  fetcher.Fetcher - QueueFeeder finished: total
> 1 records + hit by time limit :0
> 2014-06-05 10:33:44,102 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,103 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,104 INFO  fetcher.Fetcher - fetching
> file://opt/searchengine/test/ (queue crawl delay=5000ms)
> 2014-06-05 10:33:44,106 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,107 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,111 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,111 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,118 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,120 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,121 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,122 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,122 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,127 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,129 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,130 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,131 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,132 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,133 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2014-06-05 10:33:44,146 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,149 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold: -1
> 2014-06-05 10:33:44,149 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2014-06-05 10:33:44,150 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2014-06-05 10:33:44,423 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2014-06-05 10:33:45,151 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2014-06-05 10:33:45,153 INFO  fetcher.Fetcher - -activeThreads=0
> 2014-06-05 10:33:45,497 INFO  fetcher.Fetcher - Fetcher: finished at
> 2014-06-05 10:33:45, elapsed: 00:00:02
> 2014-06-05 10:33:46,660 INFO  parse.ParseSegment - ParseSegment: starting
> at 2014-06-05 10:33:46
> 2014-06-05 10:33:46,661 INFO  parse.ParseSegment - ParseSegment: segment:
> /opt/searchengine/nutch/BWCrawl/segments/20140605103340
> 2014-06-05 10:33:47,094 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-06-05 10:33:48,527 INFO  parse.ParseSegment - ParseSegment: finished
> at 2014-06-05 10:33:48, elapsed: 00:00:01
> 2014-06-05 10:33:49,949 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-06-05 10:33:49,995 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2014-06-05 10:33:49
> 2014-06-05 10:33:49,996 INFO  crawl.CrawlDb - CrawlDb update: db:
> /opt/searchengine/nutch/BWCrawl/crawldb
> 2014-06-05 10:33:49,997 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [/opt/searchengine/nutch/BWCrawl/segments/20140605103340]
> 2014-06-05 10:33:50,002 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2014-06-05 10:33:50,003 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
> false
> 2014-06-05 10:33:50,006 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2014-06-05 10:33:51,150 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 2014-06-05 10:33:51,242 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 2014-06-05 10:33:51,399 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2014-06-05 10:33:51,399 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=129600
> 2014-06-05 10:33:51,399 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 2014-06-05 10:33:51,537 INFO  crawl.CrawlDb - CrawlDb update: finished at
> 2014-06-05 10:33:51, elapsed: 00:00:01
> 2014-06-05 10:33:53,008 INFO  indexer.IndexingJob - Indexer: starting at
> 2014-06-05 10:33:53
> 2014-06-05 10:33:53,024 INFO  indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2014-06-05 10:33:53,025 INFO  indexer.IndexingJob - Indexer: URL filtering:
> false
> 2014-06-05 10:33:53,027 INFO  indexer.IndexingJob - Indexer: URL
> normalizing: false
> 2014-06-05 10:33:53,373 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2014-06-05 10:33:53,385 INFO  indexer.IndexingJob - Active IndexWriters :
> SOLRIndexWriter
>         solr.server.url : URL of the SOLR instance (mandatory)
>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>         solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>         solr.auth : use authentication (default false)
>         solr.auth.username : use authentication (default false)
>         solr.auth : username for authentication
>         solr.auth.password : password for authentication
> 
> 
> 2014-06-05 10:33:53,396 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/searchengine/nutch/BWCrawl/crawldb
> 2014-06-05 10:33:53,396 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: file:/opt/searchengine/nutch/BWCrawl/segments/20140605103340
> 2014-06-05 10:33:53,464 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-06-05 10:33:54,214 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2014-06-05 10:33:54,532 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: author dest:
> author
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: digest dest:
> digest
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: url dest: id
> 2014-06-05 10:33:54,589 INFO  solr.SolrMappingReader - source: url dest: url
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: author dest:
> author
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: digest dest:
> digest
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: url dest: id
> 2014-06-05 10:33:54,941 INFO  solr.SolrMappingReader - source: url dest: url
> 2014-06-05 10:33:55,063 INFO  indexer.IndexingJob - Indexer: finished at
> 2014-06-05 10:33:55, elapsed: 00:00:02
> 
> Result of nutch readdb:
> CrawlDb statistics start: BWCrawl/crawldb/
> Statistics for CrawlDb: BWCrawl/crawldb/
> TOTAL urls:     1
> retry 0:        1
> min score:      1.0
> avg score:      1.0
> max score:      1.0
> status 3 (db_gone):     1
> CrawlDb statistics: done
> 
> Following are some of documents I've read:
> 
>    - http://wiki.apache.org/nutch/IntranetDocumentSearch
>    - http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>    -
>    http://lucene.472066.n3.nabble.com/Crawling-the-local-file-system-with-Nutch-Document-td607747.html
> 
> System: Ubuntu 14.04, nutch 1.8, Solr 4.8.0.
> I really appreciate if someone could share some hints or any
> "running-proof" references for this subject.
> 
> Thank you.-
>