You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Drulea, Sherban" <sd...@rand.org> on 2015/10/02 03:39:32 UTC
nutch 2.3.1 doesn't crawl
Hi All,
Thanks for pointing me to the 2.3.1 release. It works without error but doesn’t crawl. I’m out of ideas why.
Here’s my environment:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1
My regex-urlfilter.txt:
———————————————
+.
———————————————
nutch-site.xml
———————————————
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch Mongo Solr Crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
<description>Regular expression naming plugin directory names to include. </description>
</property>
</configuration>
———————————————
gora.properties:
———————————————
############################
# MongoDBStore properties #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers
———————————————
Seed.txt
———————————————
http://punklawyer.com/
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
———————————————
Here are the results of the crawl command " ./bin/crawl urls methods http://127.0.0.1:8983/solr/ 2”
Injecting seed URLs
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls -crawlId methods
InjectorJob: starting at 2015-10-01 18:27:23
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 5
Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 1443749246-29495
GeneratorJob: starting at 2015-10-01 18:27:26
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
Fetching :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1443749246-29495 -crawlId methods -threads 50
FetcherJob: starting at 2015-10-01 18:27:29
FetcherJob: batchId: 1443749246-29495
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443760049865
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
Parsing :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
ParserJob: starting at 2015-10-01 18:27:43
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1443749246-29495
ParserJob: success
ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
CrawlDB update for methods
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1443749246-29495 -crawlId methods
DbUpdaterJob: starting at 2015-10-01 18:27:46
DbUpdaterJob: batchId: 1443749246-29495
DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
IndexingJob: done.
SOLR dedup -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://127.0.0.1:8983/solr/
Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 1443749274-17203
GeneratorJob: starting at 2015-10-01 18:27:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
So no errors but also no data. What else can I debug?
I see some warning in my hadoop.log but nothing alarming ….
2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2015-10-01 18:19:30,326 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:19:30,327 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:19:30,405 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:19:30,406 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
….
2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2015-10-01 18:27:24,969 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:27:24,971 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:27:25,050 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-10-01 18:27:25,052 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
I’ve been trying this for 3 days with no luck. I want to use nutch but may be forced to use other program.
My best guess is maybe something is borked with my plugin.includes:
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
<description>Regular expression naming plugin directory names to include. </description>
</property>
Are these valid? Is there a more minimal set to try?
Cheers,
Sherban
__________________________________________________________________________
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.
Re: nutch 2.3.1 doesn't crawl
Posted by "Drulea, Sherban" <sd...@rand.org>.
Seems like the problem is with the generator. It doesn¹t generate any
links to crawl. Is there any way to debug why the generator doesn¹t work?
On 10/1/15, 6:39 PM, "Drulea, Sherban" <sd...@rand.org> wrote:
>Hi All,
>
>Thanks for pointing me to the 2.3.1 release. It works without error but
>doesn¹t crawl. I¹m out of ideas why.
>
>Here¹s my environment:
>
>java version "1.8.0_60"
>
>Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>
>Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
>
>SOLR 4.6.0
>Mongo version 3.0.2.
>Nutch 2.3.1
>
>My regex-urlfilter.txt:
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>+.
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>nutch-site.xml
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
><?xml version="1.0"?>
><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
><!-- Put site-specific property overrides in this file. -->
>
><configuration>
>
> <property>
> <name>http.agent.name</name>
> <value>nutch Mongo Solr Crawler</value>
> </property>
>
> <property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.mongodb.store.MongoStore</value>
> <description>Default class for storing data</description>
> </property>
>
> <property>
> <name>plugin.includes</name>
>
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>/value>
> <description>Regular expression naming plugin directory names to
>include. </description>
> </property>
>
></configuration>
>
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>gora.properties:
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>############################
># MongoDBStore properties #
>############################
>gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>gora.mongodb.override_hadoop_configuration=false
>gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>gora.mongodb.servers=localhost:27017
>gora.mongodb.db=method_centers
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>Seed.txt
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>http://punklawyer.com/
>http://mail-archives.apache.org/mod_mbox/nutch-user/
>http://hbase.apache.org/index.html
>http://wiki.apache.org/nutch/FrontPage
>http://www.aintitcool.com/
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>Here are the results of the crawl command " ./bin/crawl urls methods
>http://127.0.0.1:8983/solr/ 2²
>
>Injecting seed URLs
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
>-crawlId methods
>
>InjectorJob: starting at 2015-10-01 18:27:23
>
>InjectorJob: Injecting urlDir: urls
>
>InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
>Gora storage class.
>
>InjectorJob: total number of urls rejected by filters: 0
>
>InjectorJob: total number of urls injected after normalization and
>filtering: 5
>
>Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
>
>Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
>
>Generating batchId
>
>Generating a new fetchlist
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>-crawlId methods -batchId 1443749246-29495
>
>GeneratorJob: starting at 2015-10-01 18:27:26
>
>GeneratorJob: Selecting best-scoring urls due for fetch.
>
>GeneratorJob: starting
>
>GeneratorJob: filtering: false
>
>GeneratorJob: normalizing: false
>
>GeneratorJob: topN: 50000
>
>GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
>
>GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
>
>Fetching :
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>1443749246-29495 -crawlId methods -threads 50
>
>FetcherJob: starting at 2015-10-01 18:27:29
>
>FetcherJob: batchId: 1443749246-29495
>
>FetcherJob: threads: 50
>
>FetcherJob: parsing: false
>
>FetcherJob: resuming: false
>
>FetcherJob : timelimit set for : 1443760049865
>
>Using queue mode : byHost
>
>Fetcher: threads: 50
>
>QueueFeeder finished: total 0 records. Hit by time limit :0
>
>-finishing thread FetcherThread0, activeThreads=0
>
>-finishing thread FetcherThread1, activeThreads=0
>
>-finishing thread FetcherThread2, activeThreads=0
>
>-finishing thread FetcherThread3, activeThreads=0
>
>-finishing thread FetcherThread4, activeThreads=0
>
>-finishing thread FetcherThread5, activeThreads=0
>
>-finishing thread FetcherThread6, activeThreads=0
>
>-finishing thread FetcherThread7, activeThreads=0
>
>-finishing thread FetcherThread8, activeThreads=0
>
>-finishing thread FetcherThread9, activeThreads=0
>
>-finishing thread FetcherThread10, activeThreads=0
>
>-finishing thread FetcherThread11, activeThreads=0
>
>-finishing thread FetcherThread12, activeThreads=0
>
>-finishing thread FetcherThread13, activeThreads=0
>
>-finishing thread FetcherThread14, activeThreads=0
>
>-finishing thread FetcherThread15, activeThreads=0
>
>-finishing thread FetcherThread16, activeThreads=0
>
>-finishing thread FetcherThread17, activeThreads=0
>
>-finishing thread FetcherThread18, activeThreads=0
>
>-finishing thread FetcherThread19, activeThreads=0
>
>-finishing thread FetcherThread20, activeThreads=0
>
>-finishing thread FetcherThread21, activeThreads=0
>
>-finishing thread FetcherThread22, activeThreads=0
>
>-finishing thread FetcherThread23, activeThreads=0
>
>-finishing thread FetcherThread25, activeThreads=0
>
>-finishing thread FetcherThread24, activeThreads=0
>
>-finishing thread FetcherThread26, activeThreads=0
>
>-finishing thread FetcherThread27, activeThreads=0
>
>-finishing thread FetcherThread28, activeThreads=0
>
>-finishing thread FetcherThread29, activeThreads=0
>
>-finishing thread FetcherThread30, activeThreads=0
>
>-finishing thread FetcherThread31, activeThreads=0
>
>-finishing thread FetcherThread32, activeThreads=0
>
>-finishing thread FetcherThread33, activeThreads=0
>
>-finishing thread FetcherThread34, activeThreads=0
>
>-finishing thread FetcherThread35, activeThreads=0
>
>-finishing thread FetcherThread36, activeThreads=0
>
>-finishing thread FetcherThread37, activeThreads=0
>
>-finishing thread FetcherThread38, activeThreads=0
>
>-finishing thread FetcherThread39, activeThreads=0
>
>-finishing thread FetcherThread40, activeThreads=0
>
>-finishing thread FetcherThread41, activeThreads=0
>
>-finishing thread FetcherThread42, activeThreads=0
>
>-finishing thread FetcherThread43, activeThreads=0
>
>-finishing thread FetcherThread44, activeThreads=0
>
>-finishing thread FetcherThread45, activeThreads=0
>
>-finishing thread FetcherThread46, activeThreads=0
>
>-finishing thread FetcherThread47, activeThreads=0
>
>-finishing thread FetcherThread48, activeThreads=0
>
>-finishing thread FetcherThread49, activeThreads=0
>
>Fetcher: throughput threshold: -1
>
>Fetcher: throughput threshold sequence: 5
>
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>URLs in 0 queues
>
>-activeThreads=0
>
>Using queue mode : byHost
>
>Fetcher: threads: 50
>
>QueueFeeder finished: total 0 records. Hit by time limit :0
>
>-finishing thread FetcherThread0, activeThreads=0
>
>-finishing thread FetcherThread1, activeThreads=0
>
>-finishing thread FetcherThread2, activeThreads=0
>
>-finishing thread FetcherThread3, activeThreads=0
>
>-finishing thread FetcherThread4, activeThreads=0
>
>-finishing thread FetcherThread5, activeThreads=0
>
>-finishing thread FetcherThread6, activeThreads=0
>
>-finishing thread FetcherThread7, activeThreads=0
>
>-finishing thread FetcherThread8, activeThreads=0
>
>-finishing thread FetcherThread9, activeThreads=0
>
>-finishing thread FetcherThread10, activeThreads=0
>
>-finishing thread FetcherThread11, activeThreads=0
>
>-finishing thread FetcherThread12, activeThreads=0
>
>-finishing thread FetcherThread13, activeThreads=0
>
>-finishing thread FetcherThread14, activeThreads=0
>
>-finishing thread FetcherThread15, activeThreads=0
>
>-finishing thread FetcherThread16, activeThreads=0
>
>-finishing thread FetcherThread17, activeThreads=0
>
>-finishing thread FetcherThread18, activeThreads=0
>
>-finishing thread FetcherThread19, activeThreads=0
>
>-finishing thread FetcherThread20, activeThreads=0
>
>-finishing thread FetcherThread21, activeThreads=0
>
>-finishing thread FetcherThread22, activeThreads=0
>
>-finishing thread FetcherThread23, activeThreads=0
>
>-finishing thread FetcherThread24, activeThreads=0
>
>-finishing thread FetcherThread25, activeThreads=0
>
>-finishing thread FetcherThread26, activeThreads=0
>
>-finishing thread FetcherThread27, activeThreads=0
>
>-finishing thread FetcherThread28, activeThreads=0
>
>-finishing thread FetcherThread29, activeThreads=0
>
>-finishing thread FetcherThread30, activeThreads=0
>
>-finishing thread FetcherThread31, activeThreads=0
>
>-finishing thread FetcherThread32, activeThreads=0
>
>-finishing thread FetcherThread33, activeThreads=0
>
>-finishing thread FetcherThread34, activeThreads=0
>
>-finishing thread FetcherThread35, activeThreads=0
>
>-finishing thread FetcherThread36, activeThreads=0
>
>-finishing thread FetcherThread37, activeThreads=0
>
>-finishing thread FetcherThread38, activeThreads=0
>
>-finishing thread FetcherThread39, activeThreads=0
>
>-finishing thread FetcherThread40, activeThreads=0
>
>-finishing thread FetcherThread41, activeThreads=0
>
>-finishing thread FetcherThread42, activeThreads=0
>
>-finishing thread FetcherThread43, activeThreads=0
>
>-finishing thread FetcherThread44, activeThreads=0
>
>-finishing thread FetcherThread45, activeThreads=0
>
>-finishing thread FetcherThread46, activeThreads=0
>
>-finishing thread FetcherThread47, activeThreads=0
>
>-finishing thread FetcherThread48, activeThreads=0
>
>Fetcher: throughput threshold: -1
>
>Fetcher: throughput threshold sequence: 5
>
>-finishing thread FetcherThread49, activeThreads=0
>
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>URLs in 0 queues
>
>-activeThreads=0
>
>FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
>
>Parsing :
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>mapred.skip.attempts.to.start.skipping=2 -D
>mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
>
>ParserJob: starting at 2015-10-01 18:27:43
>
>ParserJob: resuming: false
>
>ParserJob: forced reparse: false
>
>ParserJob: batchId: 1443749246-29495
>
>ParserJob: success
>
>ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
>
>CrawlDB update for methods
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true 1443749246-29495 -crawlId methods
>
>DbUpdaterJob: starting at 2015-10-01 18:27:46
>
>DbUpdaterJob: batchId: 1443749246-29495
>
>DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
>
>Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
>
>IndexingJob: starting
>
>Active IndexWriters :
>
>SOLRIndexWriter
>
>solr.server.url : URL of the SOLR instance (mandatory)
>
>solr.commit.size : buffer size when sending to SOLR (default 1000)
>
>solr.mapping.file : name of the mapping file for fields (default
>solrindex-mapping.xml)
>
>solr.auth : use authentication (default false)
>
>solr.auth.username : username for authentication
>
>solr.auth.password : password for authentication
>
>
>
>IndexingJob: done.
>
>SOLR dedup -> http://127.0.0.1:8983/solr/
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true http://127.0.0.1:8983/solr/
>
>Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
>
>Generating batchId
>
>Generating a new fetchlist
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>-crawlId methods -batchId 1443749274-17203
>
>GeneratorJob: starting at 2015-10-01 18:27:55
>
>GeneratorJob: Selecting best-scoring urls due for fetch.
>
>GeneratorJob: starting
>
>GeneratorJob: filtering: false
>
>GeneratorJob: normalizing: false
>
>GeneratorJob: topN: 50000
>
>GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
>
>GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
>
>Generate returned 1 (no new segments created)
>
>Escaping loop: no more URLs to fetch now
>
>So no errors but also no data. What else can I debug?
>
>I see some warning in my hadoop.log but nothing alarming Š.
>
>2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes
>where applicable
>
>2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>
>2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>
>2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule -
>maxInterval=7776000
>
>2015-10-01 18:19:30,326 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc
>al1900181322_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.retry.interval; Ignoring.
>
>2015-10-01 18:19:30,327 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc
>al1900181322_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.attempts; Ignoring.
>
>2015-10-01 18:19:30,405 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181
>322_0001/job_local1900181322_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
>
>2015-10-01 18:19:30,406 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181
>322_0001/job_local1900181322_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
>
>Š.
>
>
>2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes
>where applicable
>
>2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
>
>2015-10-01 18:27:24,969 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc
>al1182157052_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.retry.interval; Ignoring.
>
>2015-10-01 18:27:24,971 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc
>al1182157052_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.attempts; Ignoring.
>
>2015-10-01 18:27:25,050 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157
>052_0001/job_local1182157052_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
>
>2015-10-01 18:27:25,052 WARN conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157
>052_0001/job_local1182157052_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
>
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo
>Solr Crawler/Nutch-2.4-SNAPSHOT
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language =
>en-us,en-gb,en;q=0.7,*;q=0.3
>
>2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept =
>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo
>Solr Crawler/Nutch-2.4-SNAPSHOT
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language =
>en-us,en-gb,en;q=0.7,*;q=0.3
>
>2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept =
>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>I¹ve been trying this for 3 days with no luck. I want to use nutch but
>may be forced to use other program.
>
>My best guess is maybe something is borked with my plugin.includes:
>
><property>
> <name>plugin.includes</name>
>
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>/value>
> <description>Regular expression naming plugin directory names to
>include. </description>
> </property>
>
>Are these valid? Is there a more minimal set to try?
>
>Cheers,
>Sherban
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.
Re: nutch 2.3.1 doesn't crawl
Posted by "Drulea, Sherban" <sd...@rand.org>.
Fixed typo. Changed "parse-tike" to "parse-tika². Zero affect.
On 10/14/15, 12:24 PM, "Drulea, Sherban" <sd...@rand.org> wrote:
>No luck.
>
>I changed my parse-plugin.xml and still zero URLs parsed:
>
>parse-plugin.xml
>
>--------------------------------------
><?xml version="1.0" encoding="UTF-8"?>
>
><parse-plugins>
>
> <!-- by default if the mimeType is set to *, or
> if it can't be determined, use parse-tika -->
> <mimeType name="*">
> <plugin id="parse-tika" />
> </mimeType>
>
> <mimeType name="text/html">
> <plugin id="parse-tike" />
> </mimeType>
>
> <mimeType name="application/xhtml+xml">
> <plugin id="parse-tika" />
> </mimeType>
>
> <mimeType name="application/rss+xml">
> <plugin id="parse-tika" />
> <plugin id="feed" />
> </mimeType>
>
> <mimeType name="application/x-bzip2">
> <!-- try and parse it with the zip parser -->
> <plugin id="parse-zip" />
> </mimeType>
>
> <mimeType name="application/x-gzip">
> <!-- try and parse it with the zip parser -->
> <plugin id="parse-zip" />
> </mimeType>
>
> <mimeType name="application/x-javascript">
> <plugin id="parse-js" />
> </mimeType>
>
> <mimeType name="application/x-shockwave-flash">
> <plugin id="parse-swf" />
> </mimeType>
>
> <mimeType name="application/zip">
> <plugin id="parse-zip" />
> </mimeType>
>
> <mimeType name="text/xml">
> <plugin id="parse-tika" />
> <plugin id="feed" />
> </mimeType>
>
> <!-- Types for parse-ext plugin: required for unit tests to pass.
>-->
>
> <mimeType name="application/vnd.nutch.example.cat">
> <plugin id="parse-ext" />
> </mimeType>
>
> <mimeType name="application/vnd.nutch.example.md5sum">
> <plugin id="parse-ext" />
> </mimeType>
>
> <!-- alias mappings for parse-xxx names to the actual extension
>implementation
> ids described in each plugin's plugin.xml file -->
> <aliases>
> <alias name="parse-html"
> extension-id="org.apache.nutch.parse.html.HtmlParser" />
> <alias name="parse-tika"
> extension-id="org.apache.nutch.parse.tika.TikaParser" />
> <alias name="parse-ext" extension-id="ExtParser" />
> <alias name="parse-js" extension-id="JSParser" />
> <alias name="feed"
> extension-id="org.apache.nutch.parse.feed.FeedParser" />
> <alias name="parse-swf"
> extension-id="org.apache.nutch.parse.swf.SWFParser" />
> <alias name="parse-zip"
> extension-id="org.apache.nutch.parse.zip.ZipParser" />
> </aliases>
>
></parse-plugins>
>
>
>
>On 10/12/15, 8:34 PM, "cuongcm.inews" <cu...@tintuc.vn> wrote:
>
>>Have you try change parse-plugin.xml
>><mimeType name="text/html">
>> <plugin id="parse-tika" />
>></mimeType>
>>it worked for me :)
>>
>>
>>
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/nutch-2-3-1-doesn-t-crawl-tp4232374p42
>>3
>>4192.html
>>Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.
>
Re: nutch 2.3.1 doesn't crawl
Posted by "Drulea, Sherban" <sd...@rand.org>.
No luck.
I changed my parse-plugin.xml and still zero URLs parsed:
parse-plugin.xml
--------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<parse-plugins>
<!-- by default if the mimeType is set to *, or
if it can't be determined, use parse-tika -->
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="text/html">
<plugin id="parse-tike" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/rss+xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/x-bzip2">
<!-- try and parse it with the zip parser -->
<plugin id="parse-zip" />
</mimeType>
<mimeType name="application/x-gzip">
<!-- try and parse it with the zip parser -->
<plugin id="parse-zip" />
</mimeType>
<mimeType name="application/x-javascript">
<plugin id="parse-js" />
</mimeType>
<mimeType name="application/x-shockwave-flash">
<plugin id="parse-swf" />
</mimeType>
<mimeType name="application/zip">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="text/xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
<!-- Types for parse-ext plugin: required for unit tests to pass.
-->
<mimeType name="application/vnd.nutch.example.cat">
<plugin id="parse-ext" />
</mimeType>
<mimeType name="application/vnd.nutch.example.md5sum">
<plugin id="parse-ext" />
</mimeType>
<!-- alias mappings for parse-xxx names to the actual extension
implementation
ids described in each plugin's plugin.xml file -->
<aliases>
<alias name="parse-html"
extension-id="org.apache.nutch.parse.html.HtmlParser" />
<alias name="parse-tika"
extension-id="org.apache.nutch.parse.tika.TikaParser" />
<alias name="parse-ext" extension-id="ExtParser" />
<alias name="parse-js" extension-id="JSParser" />
<alias name="feed"
extension-id="org.apache.nutch.parse.feed.FeedParser" />
<alias name="parse-swf"
extension-id="org.apache.nutch.parse.swf.SWFParser" />
<alias name="parse-zip"
extension-id="org.apache.nutch.parse.zip.ZipParser" />
</aliases>
</parse-plugins>
On 10/12/15, 8:34 PM, "cuongcm.inews" <cu...@tintuc.vn> wrote:
>Have you try change parse-plugin.xml
><mimeType name="text/html">
> <plugin id="parse-tika" />
></mimeType>
>it worked for me :)
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/nutch-2-3-1-doesn-t-crawl-tp4232374p423
>4192.html
>Sent from the Nutch - User mailing list archive at Nabble.com.
__________________________________________________________________________
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.
Re: nutch 2.3.1 doesn't crawl
Posted by "cuongcm.inews" <cu...@tintuc.vn>.
Have you try change parse-plugin.xml
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
it worked for me :)
--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-3-1-doesn-t-crawl-tp4232374p4234192.html
Sent from the Nutch - User mailing list archive at Nabble.com.