You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Meiping Wang(Amelia)" <me...@hengtiansoft.com> on 2013/02/04 06:11:41 UTC

nutch issue: error parsing

Hey:
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 5
Injector: starting at 2013-02-04 13:05:18
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-02-04 13:05:33, elapsed: 00:00:14
Generator: starting at 2013-02-04 13:05:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130204130541
Generator: finished at 2013-02-04 13:05:48, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-02-04 13:05:48
Fetcher: segment: crawl/segments/20130204130541
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-02-04 13:05:58, elapsed: 00:00:10
ParseSegment: starting at 2013-02-04 13:05:58
ParseSegment: segment: crawl/segments/20130204130541
Error parsing: http://nutch.apache.org/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
Parsed (15ms):http://nutch.apache.org/
ParseSegment: finished at 2013-02-04 13:06:05, elapsed: 00:00:07
CrawlDb update: starting at 2013-02-04 13:06:05
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20130204130541]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-02-04 13:06:18, elapsed: 00:00:13
Generator: starting at 2013-02-04 13:06:18
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2013-02-04 13:06:25
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/E:/SearchEngine/workspace/Nutch2.1/crawl/segments/20130204130541
LinkDb: finished at 2013-02-04 13:06:32, elapsed: 00:00:07
crawl finished: crawl

After running the Nutch2.1 in the Eclipse (OS is Windows), there are some problems having been showed in red on the above. can anybody give me the right instructions?

Best Regards
Amelia (Meiping Wang)

Re: nutch issue: error parsing

Posted by Tejas Patil <te...@gmail.com>.

Can you provide the stack trace from logs ?
Also, the value of "plugin.includes" property from nutch-site.xml and
nutch-default.xml (ideally this file should not be modified and changes
must be done to the first one... but sometimes people accidentally do it).

Thanks,
Tejas Patil


On Sun, Feb 3, 2013 at 9:11 PM, Meiping Wang(Amelia) <
meipingwang@hengtiansoft.com> wrote:

>   Hey: ****
>
> solrUrl is not set, indexing will be skipped...****
>
> crawl started in: crawl****
>
> rootUrlDir = urls****
>
> threads = 10****
>
> depth = 2****
>
> solrUrl=null****
>
> topN = 5****
>
> Injector: starting at 2013-02-04 13:05:18****
>
> Injector: crawlDb: crawl/crawldb****
>
> Injector: urlDir: urls****
>
> Injector: Converting injected urls to crawl db entries.****
>
> Injector: total number of urls rejected by filters: 0****
>
> Injector: total number of urls injected after normalization and filtering:
> 1****
>
> Injector: Merging injected urls into crawl db.****
>
> Injector: finished at 2013-02-04 13:05:33, elapsed: 00:00:14****
>
> Generator: starting at 2013-02-04 13:05:33****
>
> Generator: Selecting best-scoring urls due for fetch.****
>
> Generator: filtering: true****
>
> Generator: normalizing: true****
>
> Generator: topN: 5****
>
> Generator: jobtracker is 'local', generating exactly one partition.****
>
> Generator: Partitioning selected urls for politeness.****
>
> Generator: segment: crawl/segments/20130204130541****
>
> Generator: finished at 2013-02-04 13:05:48, elapsed: 00:00:15****
>
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.****
>
> Fetcher: starting at 2013-02-04 13:05:48****
>
> Fetcher: segment: crawl/segments/20130204130541****
>
> Using queue mode : byHost****
>
> Fetcher: threads: 10****
>
> Fetcher: time-out divisor: 2****
>
> QueueFeeder finished: total 1 records + hit by time limit :0****
>
> Using queue mode : byHost****
>
> Using queue mode : byHost****
>
> fetching http://nutch.apache.org/ (queue crawl delay=5000ms)****
>
> Using queue mode : byHost****
>
> Using queue mode : byHost****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> Fetcher: throughput threshold: -1****
>
> Fetcher: throughput threshold retries: 5****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0****
>
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0****
>
> -finishing thread FetcherThread, activeThreads=0****
>
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0****
>
> -activeThreads=0****
>
> Fetcher: finished at 2013-02-04 13:05:58, elapsed: 00:00:10****
>
> ParseSegment: starting at 2013-02-04 13:05:58****
>
> ParseSegment: segment: crawl/segments/20130204130541****
>
> *Error parsing: http://nutch.apache.org/: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse
> content*
>
> Parsed (15ms):http://nutch.apache.org/****
>
> ParseSegment: finished at 2013-02-04 13:06:05, elapsed: 00:00:07****
>
> CrawlDb update: starting at 2013-02-04 13:06:05****
>
> CrawlDb update: db: crawl/crawldb****
>
> CrawlDb update: segments: [crawl/segments/20130204130541]****
>
> CrawlDb update: additions allowed: true****
>
> CrawlDb update: URL normalizing: true****
>
> CrawlDb update: URL filtering: true****
>
> CrawlDb update: 404 purging: false****
>
> CrawlDb update: Merging segment data into db.****
>
> CrawlDb update: finished at 2013-02-04 13:06:18, elapsed: 00:00:13****
>
> Generator: starting at 2013-02-04 13:06:18****
>
> Generator: Selecting best-scoring urls due for fetch.****
>
> Generator: filtering: true****
>
> Generator: normalizing: true****
>
> Generator: topN: 5****
>
> Generator: jobtracker is 'local', generating exactly one partition.****
>
> Generator: 0 records selected for fetching, exiting ...****
>
> Stopping at depth=1 - no more URLs to fetch.****
>
> LinkDb: starting at 2013-02-04 13:06:25****
>
> LinkDb: linkdb: crawl/linkdb****
>
> LinkDb: URL normalize: true****
>
> LinkDb: URL filter: true****
>
> LinkDb: internal links will be ignored.****
>
> LinkDb: adding segment:
> file:/E:/SearchEngine/workspace/Nutch2.1/crawl/segments/20130204130541****
>
> LinkDb: finished at 2013-02-04 13:06:32, elapsed: 00:00:07****
>
> crawl finished: crawl****
>
> ** **
>
> After running the Nutch2.1 in the Eclipse (OS is Windows), there are some
> problems having been showed in red on the above. can anybody give me the
> right instructions?****
>
> ** **
>
> Best Regards****
>
> Amelia (Meiping Wang)****
>