You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Pham Tuan Minh (JIRA)" <ji...@apache.org> on 2010/07/13 21:29:53 UTC

[jira] Created: (NUTCH-852) parser not found for contentType=application/xhtml+xml

parser not found for contentType=application/xhtml+xml
------------------------------------------------------

                 Key: NUTCH-852
                 URL: https://issues.apache.org/jira/browse/NUTCH-852
             Project: Nutch
          Issue Type: Bug
         Environment: window XP sp3, cygwin
            Reporter: Pham Tuan Minh
             Fix For: 2.0


I config nutch trunk to crawl sample site (http://www.lucidimagination.com/), then it post to solr server for indexing, however, I got following error. It seems tika parser is not working properly or tika libraries is not recognized!
----------------------
$ bin/nutch-local crawl urls -solr http://127.0.0.1:8983/solr/ -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://127.0.0.1:8983/solr/
topN = 50
Injector: starting at 2010-07-14 02:08:20
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2010-07-14 02:08:31, elapsed: 00:00:11
Generator: starting at 2010-07-14 02:08:32
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20100714020838
Generator: finished at 2010-07-14 02:08:42, elapsed: 00:00:10
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.age
nts' property.
Fetcher: starting at 2010-07-14 02:08:42
Fetcher: segment: crawl/segments/20100714020838
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.lucidimagination.com/
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=5

-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=9
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
Error parsing: http://www.lucidimagination.com/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/xhtml+xml url=http://www.lucidimagination.com/
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647)

-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2010-07-14 02:08:54, elapsed: 00:00:12
CrawlDb update: starting at 2010-07-14 02:08:54
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20100714020838]
CrawlDb update: additions allowed: true
$
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2010-07-14 02:09:01, elapsed: 00:00:07
$
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2010-07-14 02:09:06
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714014136
LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714015544
LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020206
LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020232
LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020838
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2010-07-14 02:09:19, elapsed: 00:00:12
SolrIndexer: starting at 2010-07-14 02:09:19
SolrIndexer: finished at 2010-07-14 02:09:36, elapsed: 00:00:17
SolrDeleteDuplicates: starting at 2010-07-14 02:09:41
SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
SolrDeleteDuplicates: finished at 2010-07-14 02:09:45, elapsed: 00:00:04
crawl finished: crawl
----------------------

Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-852) parser not found for contentType=application/xhtml+xml

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-852.
---------------------------------

      Assignee: Julien Nioche
    Resolution: Cannot Reproduce

I ran the crawl command using the latest trunk and did not get the parse error. I also tried parsing directly with Tika and everything went fine. Could you check that your configuration does not differ from the trunk? Thanks 

> parser not found for contentType=application/xhtml+xml
> ------------------------------------------------------
>
>                 Key: NUTCH-852
>                 URL: https://issues.apache.org/jira/browse/NUTCH-852
>             Project: Nutch
>          Issue Type: Bug
>         Environment: window XP sp3, cygwin
>            Reporter: Pham Tuan Minh
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>
> I config nutch trunk to crawl sample site (http://www.lucidimagination.com/), then it post to solr server for indexing, however, I got following error. It seems tika parser is not working properly or tika libraries is not recognized!
> ----------------------
> $ bin/nutch-local crawl urls -solr http://127.0.0.1:8983/solr/ -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=http://127.0.0.1:8983/solr/
> topN = 50
> Injector: starting at 2010-07-14 02:08:20
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2010-07-14 02:08:31, elapsed: 00:00:11
> Generator: starting at 2010-07-14 02:08:32
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20100714020838
> Generator: finished at 2010-07-14 02:08:42, elapsed: 00:00:10
> Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.age
> nts' property.
> Fetcher: starting at 2010-07-14 02:08:42
> Fetcher: segment: crawl/segments/20100714020838
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.lucidimagination.com/
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=9
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> Error parsing: http://www.lucidimagination.com/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/xhtml+xml url=http://www.lucidimagination.com/
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647)
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2010-07-14 02:08:54, elapsed: 00:00:12
> CrawlDb update: starting at 2010-07-14 02:08:54
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20100714020838]
> CrawlDb update: additions allowed: true
> $
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2010-07-14 02:09:01, elapsed: 00:00:07
> $
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2010-07-14 02:09:06
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714014136
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714015544
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020206
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020232
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020838
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2010-07-14 02:09:19, elapsed: 00:00:12
> SolrIndexer: starting at 2010-07-14 02:09:19
> SolrIndexer: finished at 2010-07-14 02:09:36, elapsed: 00:00:17
> SolrDeleteDuplicates: starting at 2010-07-14 02:09:41
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> SolrDeleteDuplicates: finished at 2010-07-14 02:09:45, elapsed: 00:00:04
> crawl finished: crawl
> ----------------------
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-852) parser not found for contentType=application/xhtml+xml

Posted by "Pham Tuan Minh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888448#action_12888448 ] 

Pham Tuan Minh commented on NUTCH-852:
--------------------------------------

Hi Julien,

Thank for supports!

I checked again, I missed some plug-in in plugin.includes attribute in nutch-site.xml. Currently, this file contains no attribute  and suggestion, so it is quite difficult for end user. I will add it in other issue for improvement.

Thanks,

> parser not found for contentType=application/xhtml+xml
> ------------------------------------------------------
>
>                 Key: NUTCH-852
>                 URL: https://issues.apache.org/jira/browse/NUTCH-852
>             Project: Nutch
>          Issue Type: Bug
>         Environment: window XP sp3, cygwin
>            Reporter: Pham Tuan Minh
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>
> I config nutch trunk to crawl sample site (http://www.lucidimagination.com/), then it post to solr server for indexing, however, I got following error. It seems tika parser is not working properly or tika libraries is not recognized!
> ----------------------
> $ bin/nutch-local crawl urls -solr http://127.0.0.1:8983/solr/ -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=http://127.0.0.1:8983/solr/
> topN = 50
> Injector: starting at 2010-07-14 02:08:20
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2010-07-14 02:08:31, elapsed: 00:00:11
> Generator: starting at 2010-07-14 02:08:32
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20100714020838
> Generator: finished at 2010-07-14 02:08:42, elapsed: 00:00:10
> Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.age
> nts' property.
> Fetcher: starting at 2010-07-14 02:08:42
> Fetcher: segment: crawl/segments/20100714020838
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.lucidimagination.com/
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=9
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> Error parsing: http://www.lucidimagination.com/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/xhtml+xml url=http://www.lucidimagination.com/
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647)
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2010-07-14 02:08:54, elapsed: 00:00:12
> CrawlDb update: starting at 2010-07-14 02:08:54
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20100714020838]
> CrawlDb update: additions allowed: true
> $
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2010-07-14 02:09:01, elapsed: 00:00:07
> $
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2010-07-14 02:09:06
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714014136
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714015544
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020206
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020232
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020838
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2010-07-14 02:09:19, elapsed: 00:00:12
> SolrIndexer: starting at 2010-07-14 02:09:19
> SolrIndexer: finished at 2010-07-14 02:09:36, elapsed: 00:00:17
> SolrDeleteDuplicates: starting at 2010-07-14 02:09:41
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> SolrDeleteDuplicates: finished at 2010-07-14 02:09:45, elapsed: 00:00:04
> crawl finished: crawl
> ----------------------
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.