You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Drulea, Sherban" <sd...@rand.org> on 2015/10/02 03:39:32 UTC

nutch 2.3.1 doesn't crawl

Hi All,

Thanks for pointing me to the 2.3.1 release. It works without error but doesn’t crawl. I’m out of ideas why.

Here’s my environment:

java version "1.8.0_60"

Java(TM) SE Runtime Environment (build 1.8.0_60-b27)

Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1

My regex-urlfilter.txt:
———————————————
+.
———————————————

nutch-site.xml
———————————————
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>http.agent.name</name>
        <value>nutch Mongo Solr Crawler</value>
    </property>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.mongodb.store.MongoStore</value>
        <description>Default class for storing data</description>
    </property>

    <property>
        <name>plugin.includes</name>
        <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
        <description>Regular expression naming plugin directory names to include. </description>
   </property>

</configuration>

———————————————

gora.properties:
———————————————
############################
# MongoDBStore properties  #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers
———————————————

Seed.txt
———————————————
http://punklawyer.com/
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
———————————————

Here are the results of the crawl command " ./bin/crawl urls methods http://127.0.0.1:8983/solr/ 2”

Injecting seed URLs

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls -crawlId methods

InjectorJob: starting at 2015-10-01 18:27:23

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 5

Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02

Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2

Generating batchId

Generating a new fetchlist

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 1443749246-29495

GeneratorJob: starting at 2015-10-01 18:27:26

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02

GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs

Fetching :

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1443749246-29495 -crawlId methods -threads 50

FetcherJob: starting at 2015-10-01 18:27:29

FetcherJob: batchId: 1443749246-29495

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1443760049865

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=0

-finishing thread FetcherThread8, activeThreads=0

-finishing thread FetcherThread9, activeThreads=0

-finishing thread FetcherThread10, activeThreads=0

-finishing thread FetcherThread11, activeThreads=0

-finishing thread FetcherThread12, activeThreads=0

-finishing thread FetcherThread13, activeThreads=0

-finishing thread FetcherThread14, activeThreads=0

-finishing thread FetcherThread15, activeThreads=0

-finishing thread FetcherThread16, activeThreads=0

-finishing thread FetcherThread17, activeThreads=0

-finishing thread FetcherThread18, activeThreads=0

-finishing thread FetcherThread19, activeThreads=0

-finishing thread FetcherThread20, activeThreads=0

-finishing thread FetcherThread21, activeThreads=0

-finishing thread FetcherThread22, activeThreads=0

-finishing thread FetcherThread23, activeThreads=0

-finishing thread FetcherThread25, activeThreads=0

-finishing thread FetcherThread24, activeThreads=0

-finishing thread FetcherThread26, activeThreads=0

-finishing thread FetcherThread27, activeThreads=0

-finishing thread FetcherThread28, activeThreads=0

-finishing thread FetcherThread29, activeThreads=0

-finishing thread FetcherThread30, activeThreads=0

-finishing thread FetcherThread31, activeThreads=0

-finishing thread FetcherThread32, activeThreads=0

-finishing thread FetcherThread33, activeThreads=0

-finishing thread FetcherThread34, activeThreads=0

-finishing thread FetcherThread35, activeThreads=0

-finishing thread FetcherThread36, activeThreads=0

-finishing thread FetcherThread37, activeThreads=0

-finishing thread FetcherThread38, activeThreads=0

-finishing thread FetcherThread39, activeThreads=0

-finishing thread FetcherThread40, activeThreads=0

-finishing thread FetcherThread41, activeThreads=0

-finishing thread FetcherThread42, activeThreads=0

-finishing thread FetcherThread43, activeThreads=0

-finishing thread FetcherThread44, activeThreads=0

-finishing thread FetcherThread45, activeThreads=0

-finishing thread FetcherThread46, activeThreads=0

-finishing thread FetcherThread47, activeThreads=0

-finishing thread FetcherThread48, activeThreads=0

-finishing thread FetcherThread49, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=0

-finishing thread FetcherThread8, activeThreads=0

-finishing thread FetcherThread9, activeThreads=0

-finishing thread FetcherThread10, activeThreads=0

-finishing thread FetcherThread11, activeThreads=0

-finishing thread FetcherThread12, activeThreads=0

-finishing thread FetcherThread13, activeThreads=0

-finishing thread FetcherThread14, activeThreads=0

-finishing thread FetcherThread15, activeThreads=0

-finishing thread FetcherThread16, activeThreads=0

-finishing thread FetcherThread17, activeThreads=0

-finishing thread FetcherThread18, activeThreads=0

-finishing thread FetcherThread19, activeThreads=0

-finishing thread FetcherThread20, activeThreads=0

-finishing thread FetcherThread21, activeThreads=0

-finishing thread FetcherThread22, activeThreads=0

-finishing thread FetcherThread23, activeThreads=0

-finishing thread FetcherThread24, activeThreads=0

-finishing thread FetcherThread25, activeThreads=0

-finishing thread FetcherThread26, activeThreads=0

-finishing thread FetcherThread27, activeThreads=0

-finishing thread FetcherThread28, activeThreads=0

-finishing thread FetcherThread29, activeThreads=0

-finishing thread FetcherThread30, activeThreads=0

-finishing thread FetcherThread31, activeThreads=0

-finishing thread FetcherThread32, activeThreads=0

-finishing thread FetcherThread33, activeThreads=0

-finishing thread FetcherThread34, activeThreads=0

-finishing thread FetcherThread35, activeThreads=0

-finishing thread FetcherThread36, activeThreads=0

-finishing thread FetcherThread37, activeThreads=0

-finishing thread FetcherThread38, activeThreads=0

-finishing thread FetcherThread39, activeThreads=0

-finishing thread FetcherThread40, activeThreads=0

-finishing thread FetcherThread41, activeThreads=0

-finishing thread FetcherThread42, activeThreads=0

-finishing thread FetcherThread43, activeThreads=0

-finishing thread FetcherThread44, activeThreads=0

-finishing thread FetcherThread45, activeThreads=0

-finishing thread FetcherThread46, activeThreads=0

-finishing thread FetcherThread47, activeThreads=0

-finishing thread FetcherThread48, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread49, activeThreads=0

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12

Parsing :

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods

ParserJob: starting at 2015-10-01 18:27:43

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: 1443749246-29495

ParserJob: success

ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02

CrawlDB update for methods

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1443749246-29495 -crawlId methods

DbUpdaterJob: starting at 2015-10-01 18:27:46

DbUpdaterJob: batchId: 1443749246-29495

DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02

Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods

IndexingJob: starting

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : username for authentication

solr.auth.password : password for authentication



IndexingJob: done.

SOLR dedup -> http://127.0.0.1:8983/solr/

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://127.0.0.1:8983/solr/

Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2

Generating batchId

Generating a new fetchlist

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 1443749274-17203

GeneratorJob: starting at 2015-10-01 18:27:55

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02

GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs

Generate returned 1 (no new segments created)

Escaping loop: no more URLs to fetch now

So no errors but also no data. What else can I debug?

I see some warning in my hadoop.log but nothing alarming ….

2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000

2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000

2015-10-01 18:19:30,326 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:19:30,327 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_local1900181322_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

2015-10-01 18:19:30,405 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:19:30,406 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181322_0001/job_local1900181322_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

….


2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.

2015-10-01 18:27:24,969 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:27:24,971 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_local1182157052_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

2015-10-01 18:27:25,050 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.

2015-10-01 18:27:25,052 WARN  conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157052_0001/job_local1182157052_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.


2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo Solr Crawler/Nutch-2.4-SNAPSHOT

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo Solr Crawler/Nutch-2.4-SNAPSHOT

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3

2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

I’ve been trying this for 3 days with no luck. I want to use nutch but may be forced to use other program.

My best guess is maybe something is borked with my plugin.includes:

<property>
        <name>plugin.includes</name>
        <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</value>
        <description>Regular expression naming plugin directory names to include. </description>
   </property>

Are these valid? Is there a more minimal set to try?

Cheers,
Sherban


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.

Re: nutch 2.3.1 doesn't crawl

Posted by "Drulea, Sherban" <sd...@rand.org>.
Seems like the problem is with the generator. It doesn¹t generate any
links to crawl. Is there any way to debug why the generator doesn¹t work?



On 10/1/15, 6:39 PM, "Drulea, Sherban" <sd...@rand.org> wrote:

>Hi All,
>
>Thanks for pointing me to the 2.3.1 release. It works without error but
>doesn¹t crawl. I¹m out of ideas why.
>
>Here¹s my environment:
>
>java version "1.8.0_60"
>
>Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>
>Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
>
>SOLR 4.6.0
>Mongo version 3.0.2.
>Nutch 2.3.1
>
>My regex-urlfilter.txt:
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>+.
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>nutch-site.xml
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
><?xml version="1.0"?>
><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
><!-- Put site-specific property overrides in this file. -->
>
><configuration>
>
>    <property>
>        <name>http.agent.name</name>
>        <value>nutch Mongo Solr Crawler</value>
>    </property>
>
>    <property>
>        <name>storage.data.store.class</name>
>        <value>org.apache.gora.mongodb.store.MongoStore</value>
>        <description>Default class for storing data</description>
>    </property>
>
>    <property>
>        <name>plugin.includes</name>
>        
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>/value>
>        <description>Regular expression naming plugin directory names to
>include. </description>
>   </property>
>
></configuration>
>
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>gora.properties:
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>############################
># MongoDBStore properties  #
>############################
>gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>gora.mongodb.override_hadoop_configuration=false
>gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>gora.mongodb.servers=localhost:27017
>gora.mongodb.db=method_centers
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>Seed.txt
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>http://punklawyer.com/
>http://mail-archives.apache.org/mod_mbox/nutch-user/
>http://hbase.apache.org/index.html
>http://wiki.apache.org/nutch/FrontPage
>http://www.aintitcool.com/
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>Here are the results of the crawl command " ./bin/crawl urls methods
>http://127.0.0.1:8983/solr/ 2²
>
>Injecting seed URLs
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
>-crawlId methods
>
>InjectorJob: starting at 2015-10-01 18:27:23
>
>InjectorJob: Injecting urlDir: urls
>
>InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
>Gora storage class.
>
>InjectorJob: total number of urls rejected by filters: 0
>
>InjectorJob: total number of urls injected after normalization and
>filtering: 5
>
>Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
>
>Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
>
>Generating batchId
>
>Generating a new fetchlist
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>-crawlId methods -batchId 1443749246-29495
>
>GeneratorJob: starting at 2015-10-01 18:27:26
>
>GeneratorJob: Selecting best-scoring urls due for fetch.
>
>GeneratorJob: starting
>
>GeneratorJob: filtering: false
>
>GeneratorJob: normalizing: false
>
>GeneratorJob: topN: 50000
>
>GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
>
>GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
>
>Fetching :
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>1443749246-29495 -crawlId methods -threads 50
>
>FetcherJob: starting at 2015-10-01 18:27:29
>
>FetcherJob: batchId: 1443749246-29495
>
>FetcherJob: threads: 50
>
>FetcherJob: parsing: false
>
>FetcherJob: resuming: false
>
>FetcherJob : timelimit set for : 1443760049865
>
>Using queue mode : byHost
>
>Fetcher: threads: 50
>
>QueueFeeder finished: total 0 records. Hit by time limit :0
>
>-finishing thread FetcherThread0, activeThreads=0
>
>-finishing thread FetcherThread1, activeThreads=0
>
>-finishing thread FetcherThread2, activeThreads=0
>
>-finishing thread FetcherThread3, activeThreads=0
>
>-finishing thread FetcherThread4, activeThreads=0
>
>-finishing thread FetcherThread5, activeThreads=0
>
>-finishing thread FetcherThread6, activeThreads=0
>
>-finishing thread FetcherThread7, activeThreads=0
>
>-finishing thread FetcherThread8, activeThreads=0
>
>-finishing thread FetcherThread9, activeThreads=0
>
>-finishing thread FetcherThread10, activeThreads=0
>
>-finishing thread FetcherThread11, activeThreads=0
>
>-finishing thread FetcherThread12, activeThreads=0
>
>-finishing thread FetcherThread13, activeThreads=0
>
>-finishing thread FetcherThread14, activeThreads=0
>
>-finishing thread FetcherThread15, activeThreads=0
>
>-finishing thread FetcherThread16, activeThreads=0
>
>-finishing thread FetcherThread17, activeThreads=0
>
>-finishing thread FetcherThread18, activeThreads=0
>
>-finishing thread FetcherThread19, activeThreads=0
>
>-finishing thread FetcherThread20, activeThreads=0
>
>-finishing thread FetcherThread21, activeThreads=0
>
>-finishing thread FetcherThread22, activeThreads=0
>
>-finishing thread FetcherThread23, activeThreads=0
>
>-finishing thread FetcherThread25, activeThreads=0
>
>-finishing thread FetcherThread24, activeThreads=0
>
>-finishing thread FetcherThread26, activeThreads=0
>
>-finishing thread FetcherThread27, activeThreads=0
>
>-finishing thread FetcherThread28, activeThreads=0
>
>-finishing thread FetcherThread29, activeThreads=0
>
>-finishing thread FetcherThread30, activeThreads=0
>
>-finishing thread FetcherThread31, activeThreads=0
>
>-finishing thread FetcherThread32, activeThreads=0
>
>-finishing thread FetcherThread33, activeThreads=0
>
>-finishing thread FetcherThread34, activeThreads=0
>
>-finishing thread FetcherThread35, activeThreads=0
>
>-finishing thread FetcherThread36, activeThreads=0
>
>-finishing thread FetcherThread37, activeThreads=0
>
>-finishing thread FetcherThread38, activeThreads=0
>
>-finishing thread FetcherThread39, activeThreads=0
>
>-finishing thread FetcherThread40, activeThreads=0
>
>-finishing thread FetcherThread41, activeThreads=0
>
>-finishing thread FetcherThread42, activeThreads=0
>
>-finishing thread FetcherThread43, activeThreads=0
>
>-finishing thread FetcherThread44, activeThreads=0
>
>-finishing thread FetcherThread45, activeThreads=0
>
>-finishing thread FetcherThread46, activeThreads=0
>
>-finishing thread FetcherThread47, activeThreads=0
>
>-finishing thread FetcherThread48, activeThreads=0
>
>-finishing thread FetcherThread49, activeThreads=0
>
>Fetcher: throughput threshold: -1
>
>Fetcher: throughput threshold sequence: 5
>
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>URLs in 0 queues
>
>-activeThreads=0
>
>Using queue mode : byHost
>
>Fetcher: threads: 50
>
>QueueFeeder finished: total 0 records. Hit by time limit :0
>
>-finishing thread FetcherThread0, activeThreads=0
>
>-finishing thread FetcherThread1, activeThreads=0
>
>-finishing thread FetcherThread2, activeThreads=0
>
>-finishing thread FetcherThread3, activeThreads=0
>
>-finishing thread FetcherThread4, activeThreads=0
>
>-finishing thread FetcherThread5, activeThreads=0
>
>-finishing thread FetcherThread6, activeThreads=0
>
>-finishing thread FetcherThread7, activeThreads=0
>
>-finishing thread FetcherThread8, activeThreads=0
>
>-finishing thread FetcherThread9, activeThreads=0
>
>-finishing thread FetcherThread10, activeThreads=0
>
>-finishing thread FetcherThread11, activeThreads=0
>
>-finishing thread FetcherThread12, activeThreads=0
>
>-finishing thread FetcherThread13, activeThreads=0
>
>-finishing thread FetcherThread14, activeThreads=0
>
>-finishing thread FetcherThread15, activeThreads=0
>
>-finishing thread FetcherThread16, activeThreads=0
>
>-finishing thread FetcherThread17, activeThreads=0
>
>-finishing thread FetcherThread18, activeThreads=0
>
>-finishing thread FetcherThread19, activeThreads=0
>
>-finishing thread FetcherThread20, activeThreads=0
>
>-finishing thread FetcherThread21, activeThreads=0
>
>-finishing thread FetcherThread22, activeThreads=0
>
>-finishing thread FetcherThread23, activeThreads=0
>
>-finishing thread FetcherThread24, activeThreads=0
>
>-finishing thread FetcherThread25, activeThreads=0
>
>-finishing thread FetcherThread26, activeThreads=0
>
>-finishing thread FetcherThread27, activeThreads=0
>
>-finishing thread FetcherThread28, activeThreads=0
>
>-finishing thread FetcherThread29, activeThreads=0
>
>-finishing thread FetcherThread30, activeThreads=0
>
>-finishing thread FetcherThread31, activeThreads=0
>
>-finishing thread FetcherThread32, activeThreads=0
>
>-finishing thread FetcherThread33, activeThreads=0
>
>-finishing thread FetcherThread34, activeThreads=0
>
>-finishing thread FetcherThread35, activeThreads=0
>
>-finishing thread FetcherThread36, activeThreads=0
>
>-finishing thread FetcherThread37, activeThreads=0
>
>-finishing thread FetcherThread38, activeThreads=0
>
>-finishing thread FetcherThread39, activeThreads=0
>
>-finishing thread FetcherThread40, activeThreads=0
>
>-finishing thread FetcherThread41, activeThreads=0
>
>-finishing thread FetcherThread42, activeThreads=0
>
>-finishing thread FetcherThread43, activeThreads=0
>
>-finishing thread FetcherThread44, activeThreads=0
>
>-finishing thread FetcherThread45, activeThreads=0
>
>-finishing thread FetcherThread46, activeThreads=0
>
>-finishing thread FetcherThread47, activeThreads=0
>
>-finishing thread FetcherThread48, activeThreads=0
>
>Fetcher: throughput threshold: -1
>
>Fetcher: throughput threshold sequence: 5
>
>-finishing thread FetcherThread49, activeThreads=0
>
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>URLs in 0 queues
>
>-activeThreads=0
>
>FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
>
>Parsing :
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>mapred.skip.attempts.to.start.skipping=2 -D
>mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
>
>ParserJob: starting at 2015-10-01 18:27:43
>
>ParserJob: resuming: false
>
>ParserJob: forced reparse: false
>
>ParserJob: batchId: 1443749246-29495
>
>ParserJob: success
>
>ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
>
>CrawlDB update for methods
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true 1443749246-29495 -crawlId methods
>
>DbUpdaterJob: starting at 2015-10-01 18:27:46
>
>DbUpdaterJob: batchId: 1443749246-29495
>
>DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
>
>Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
>
>IndexingJob: starting
>
>Active IndexWriters :
>
>SOLRIndexWriter
>
>solr.server.url : URL of the SOLR instance (mandatory)
>
>solr.commit.size : buffer size when sending to SOLR (default 1000)
>
>solr.mapping.file : name of the mapping file for fields (default
>solrindex-mapping.xml)
>
>solr.auth : use authentication (default false)
>
>solr.auth.username : username for authentication
>
>solr.auth.password : password for authentication
>
>
>
>IndexingJob: done.
>
>SOLR dedup -> http://127.0.0.1:8983/solr/
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true http://127.0.0.1:8983/solr/
>
>Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
>
>Generating batchId
>
>Generating a new fetchlist
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>-crawlId methods -batchId 1443749274-17203
>
>GeneratorJob: starting at 2015-10-01 18:27:55
>
>GeneratorJob: Selecting best-scoring urls due for fetch.
>
>GeneratorJob: starting
>
>GeneratorJob: filtering: false
>
>GeneratorJob: normalizing: false
>
>GeneratorJob: topN: 50000
>
>GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
>
>GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
>
>Generate returned 1 (no new segments created)
>
>Escaping loop: no more URLs to fetch now
>
>So no errors but also no data. What else can I debug?
>
>I see some warning in my hadoop.log but nothing alarming Š.
>
>2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes
>where applicable
>
>2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>
>2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>
>2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>
>2015-10-01 18:19:30,326 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc
>al1900181322_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:19:30,327 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc
>al1900181322_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>2015-10-01 18:19:30,405 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181
>322_0001/job_local1900181322_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:19:30,406 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181
>322_0001/job_local1900181322_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>Š.
>
>
>2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes
>where applicable
>
>2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
>
>2015-10-01 18:27:24,969 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc
>al1182157052_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:27:24,971 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc
>al1182157052_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>2015-10-01 18:27:25,050 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157
>052_0001/job_local1182157052_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:27:25,052 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157
>052_0001/job_local1182157052_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
>Solr Crawler/Nutch-2.4-SNAPSHOT
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
>en-us,en-gb,en;q=0.7,*;q=0.3
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
>Solr Crawler/Nutch-2.4-SNAPSHOT
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
>en-us,en-gb,en;q=0.7,*;q=0.3
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>I¹ve been trying this for 3 days with no luck. I want to use nutch but
>may be forced to use other program.
>
>My best guess is maybe something is borked with my plugin.includes:
>
><property>
>        <name>plugin.includes</name>
>        
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>/value>
>        <description>Regular expression naming plugin directory names to
>include. </description>
>   </property>
>
>Are these valid? Is there a more minimal set to try?
>
>Cheers,
>Sherban
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.


Re: nutch 2.3.1 doesn't crawl

Posted by "Drulea, Sherban" <sd...@rand.org>.
Fixed typo. Changed "parse-tike" to "parse-tika². Zero affect.


On 10/14/15, 12:24 PM, "Drulea, Sherban" <sd...@rand.org> wrote:

>No luck.
>
>I changed my parse-plugin.xml and still zero URLs parsed:
>
>parse-plugin.xml
>
>--------------------------------------
><?xml version="1.0" encoding="UTF-8"?>
>
><parse-plugins>
>
>  <!--  by default if the mimeType is set to *, or
>        if it can't be determined, use parse-tika -->
>	<mimeType name="*">
>	  <plugin id="parse-tika" />
>	</mimeType>
>
>	<mimeType name="text/html">
>		<plugin id="parse-tike" />
>	</mimeType>
>
>        <mimeType name="application/xhtml+xml">
>		<plugin id="parse-tika" />
>	</mimeType>
>
>	<mimeType name="application/rss+xml">
>	    <plugin id="parse-tika" />
>	    <plugin id="feed" />
>	</mimeType>
>
>	<mimeType name="application/x-bzip2">
>		<!--  try and parse it with the zip parser -->
>		<plugin id="parse-zip" />
>	</mimeType>
>
>	<mimeType name="application/x-gzip">
>		<!--  try and parse it with the zip parser -->
>		<plugin id="parse-zip" />
>	</mimeType>
>
>	<mimeType name="application/x-javascript">
>		<plugin id="parse-js" />
>	</mimeType>
>
>	<mimeType name="application/x-shockwave-flash">
>		<plugin id="parse-swf" />
>	</mimeType>
>
>	<mimeType name="application/zip">
>		<plugin id="parse-zip" />
>	</mimeType>
>
>	<mimeType name="text/xml">
>		<plugin id="parse-tika" />
>		<plugin id="feed" />
>	</mimeType>
>
>       <!-- Types for parse-ext plugin: required for unit tests to pass.
>-->
>
>	<mimeType name="application/vnd.nutch.example.cat">
>		<plugin id="parse-ext" />
>	</mimeType>
>
>	<mimeType name="application/vnd.nutch.example.md5sum">
>		<plugin id="parse-ext" />
>	</mimeType>
>
>	<!--  alias mappings for parse-xxx names to the actual extension
>implementation 
>	ids described in each plugin's plugin.xml file -->
>	<aliases>
>		<alias name="parse-html"
>			extension-id="org.apache.nutch.parse.html.HtmlParser" />
>		<alias name="parse-tika"
>			extension-id="org.apache.nutch.parse.tika.TikaParser" />
>		<alias name="parse-ext" extension-id="ExtParser" />
>		<alias name="parse-js" extension-id="JSParser" />
>		<alias name="feed"
>			extension-id="org.apache.nutch.parse.feed.FeedParser" />
>		<alias name="parse-swf"
>			extension-id="org.apache.nutch.parse.swf.SWFParser" />
>		<alias name="parse-zip"
>			extension-id="org.apache.nutch.parse.zip.ZipParser" />
>	</aliases>
>	
></parse-plugins>
>
>
>
>On 10/12/15, 8:34 PM, "cuongcm.inews" <cu...@tintuc.vn> wrote:
>
>>Have you try change parse-plugin.xml
>><mimeType name="text/html">
>>	<plugin id="parse-tika" />
>></mimeType>
>>it worked for me :)
>>
>>
>>
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/nutch-2-3-1-doesn-t-crawl-tp4232374p42
>>3
>>4192.html
>>Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.
>


Re: nutch 2.3.1 doesn't crawl

Posted by "Drulea, Sherban" <sd...@rand.org>.
No luck.

I changed my parse-plugin.xml and still zero URLs parsed:

parse-plugin.xml

--------------------------------------
<?xml version="1.0" encoding="UTF-8"?>

<parse-plugins>

  <!--  by default if the mimeType is set to *, or
        if it can't be determined, use parse-tika -->
	<mimeType name="*">
	  <plugin id="parse-tika" />
	</mimeType>

	<mimeType name="text/html">
		<plugin id="parse-tike" />
	</mimeType>

        <mimeType name="application/xhtml+xml">
		<plugin id="parse-tika" />
	</mimeType>

	<mimeType name="application/rss+xml">
	    <plugin id="parse-tika" />
	    <plugin id="feed" />
	</mimeType>

	<mimeType name="application/x-bzip2">
		<!--  try and parse it with the zip parser -->
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-gzip">
		<!--  try and parse it with the zip parser -->
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-javascript">
		<plugin id="parse-js" />
	</mimeType>

	<mimeType name="application/x-shockwave-flash">
		<plugin id="parse-swf" />
	</mimeType>

	<mimeType name="application/zip">
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="text/xml">
		<plugin id="parse-tika" />
		<plugin id="feed" />
	</mimeType>

       <!-- Types for parse-ext plugin: required for unit tests to pass.
-->

	<mimeType name="application/vnd.nutch.example.cat">
		<plugin id="parse-ext" />
	</mimeType>

	<mimeType name="application/vnd.nutch.example.md5sum">
		<plugin id="parse-ext" />
	</mimeType>

	<!--  alias mappings for parse-xxx names to the actual extension
implementation 
	ids described in each plugin's plugin.xml file -->
	<aliases>
		<alias name="parse-html"
			extension-id="org.apache.nutch.parse.html.HtmlParser" />
		<alias name="parse-tika"
			extension-id="org.apache.nutch.parse.tika.TikaParser" />
		<alias name="parse-ext" extension-id="ExtParser" />
		<alias name="parse-js" extension-id="JSParser" />
		<alias name="feed"
			extension-id="org.apache.nutch.parse.feed.FeedParser" />
		<alias name="parse-swf"
			extension-id="org.apache.nutch.parse.swf.SWFParser" />
		<alias name="parse-zip"
			extension-id="org.apache.nutch.parse.zip.ZipParser" />
	</aliases>
	
</parse-plugins>



On 10/12/15, 8:34 PM, "cuongcm.inews" <cu...@tintuc.vn> wrote:

>Have you try change parse-plugin.xml
><mimeType name="text/html">
>	<plugin id="parse-tika" />
></mimeType>
>it worked for me :)
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/nutch-2-3-1-doesn-t-crawl-tp4232374p423
>4192.html
>Sent from the Nutch - User mailing list archive at Nabble.com.


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.


Re: nutch 2.3.1 doesn't crawl

Posted by "cuongcm.inews" <cu...@tintuc.vn>.
Have you try change parse-plugin.xml
<mimeType name="text/html">
	<plugin id="parse-tika" />
</mimeType>
it worked for me :)



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-3-1-doesn-t-crawl-tp4232374p4234192.html
Sent from the Nutch - User mailing list archive at Nabble.com.