You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Pham Tuan Minh (JIRA)" <ji...@apache.org> on 2010/07/14 19:43:50 UTC
[jira] Commented: (NUTCH-852) parser not found for contentType=application/xhtml+xml

    [ https://issues.apache.org/jira/browse/NUTCH-852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888448#action_12888448 ] 

Pham Tuan Minh commented on NUTCH-852:
--------------------------------------

Hi Julien,

Thank for supports!

I checked again, I missed some plug-in in plugin.includes attribute in nutch-site.xml. Currently, this file contains no attribute  and suggestion, so it is quite difficult for end user. I will add it in other issue for improvement.

Thanks,

> parser not found for contentType=application/xhtml+xml
> ------------------------------------------------------
>
>                 Key: NUTCH-852
>                 URL: https://issues.apache.org/jira/browse/NUTCH-852
>             Project: Nutch
>          Issue Type: Bug
>         Environment: window XP sp3, cygwin
>            Reporter: Pham Tuan Minh
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>
> I config nutch trunk to crawl sample site (http://www.lucidimagination.com/), then it post to solr server for indexing, however, I got following error. It seems tika parser is not working properly or tika libraries is not recognized!
> ----------------------
> $ bin/nutch-local crawl urls -solr http://127.0.0.1:8983/solr/ -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=http://127.0.0.1:8983/solr/
> topN = 50
> Injector: starting at 2010-07-14 02:08:20
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2010-07-14 02:08:31, elapsed: 00:00:11
> Generator: starting at 2010-07-14 02:08:32
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20100714020838
> Generator: finished at 2010-07-14 02:08:42, elapsed: 00:00:10
> Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.age
> nts' property.
> Fetcher: starting at 2010-07-14 02:08:42
> Fetcher: segment: crawl/segments/20100714020838
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.lucidimagination.com/
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=9
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> Error parsing: http://www.lucidimagination.com/: org.apache.nutch.parse.ParseException: parser not found for contentType=application/xhtml+xml url=http://www.lucidimagination.com/
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647)
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2010-07-14 02:08:54, elapsed: 00:00:12
> CrawlDb update: starting at 2010-07-14 02:08:54
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20100714020838]
> CrawlDb update: additions allowed: true
> $
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2010-07-14 02:09:01, elapsed: 00:00:07
> $
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2010-07-14 02:09:06
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714014136
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714015544
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020206
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020232
> LinkDb: adding segment: file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020838
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2010-07-14 02:09:19, elapsed: 00:00:12
> SolrIndexer: starting at 2010-07-14 02:09:19
> SolrIndexer: finished at 2010-07-14 02:09:36, elapsed: 00:00:17
> SolrDeleteDuplicates: starting at 2010-07-14 02:09:41
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> SolrDeleteDuplicates: finished at 2010-07-14 02:09:45, elapsed: 00:00:04
> crawl finished: crawl
> ----------------------
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.