You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Arcondo Dasilva <ar...@gmail.com> on 2012/12/27 22:26:01 UTC

Native Hadoop library not loaded and Cannot parse sites contents

Hello,

When I read the hadoop.log, there are lot of things who are going wrong. I
googled but seems that kind of errors are not often reported. could you
please help me figure out how I can get rid of this.

Thanks in advance for your help,

I also joined the nutch-site.xml and regex-urlfilter.txt

my crawl command : *bin/nutch crawl urls -depth 3 -topN 500 -threads 10*

my hadoop.log

uname -a : *Linux drupal7 2.6.32-5-686 #1 SMP Sun May 6 04:01:19 UTC 2012
i686 GNU/Linux*

------------------------------------------------------------------------

2012-12-27 20:47:23,152 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-12-27 20:47:23,239 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2012-12-27 20:47:23,686 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:23,691 INFO  plugin.PluginRepository - Plugins: looking
in: /opt/nutch21/runtime/local/plugins
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Registered Plugins:
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - HTTP Framework
(lib-http)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Tika Parser Plug-in
(parse-tika)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Registered
Extension-Points:
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Parse Filter
(org.apache.nutch.parse.ParseFilter)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-12-27 20:47:24,035 INFO  plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-12-27 20:47:24,221 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2012-12-27 20:47:26,608 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:27,796 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:28,449 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:47:28,449 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:47:28,449 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:47:28,499 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2012-12-27 20:47:30,815 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:33,751 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:34,741 INFO  fetcher.FetcherJob - FetcherJob: threads: 10
2012-12-27 20:47:34,741 INFO  fetcher.FetcherJob - FetcherJob: parsing:
false
2012-12-27 20:47:34,741 INFO  fetcher.FetcherJob - FetcherJob: resuming:
false
2012-12-27 20:47:34,741 INFO  fetcher.FetcherJob - FetcherJob : timelimit
set for : -1
2012-12-27 20:47:35,072 INFO  http.Http - http.proxy.host = null
2012-12-27 20:47:35,073 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:47:35,073 INFO  http.Http - http.timeout = 10000
2012-12-27 20:47:35,073 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:47:35,073 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:47:35,073 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:47:35,073 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:47:35,190 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:38,178 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:38,180 INFO  fetcher.FetcherJob - Using queue mode : byHost
2012-12-27 20:47:38,180 INFO  fetcher.FetcherJob - Fetcher: threads: 10
2012-12-27 20:47:38,192 INFO  fetcher.FetcherJob - Fetcher: throughput
threshold: -1
2012-12-27 20:47:38,192 INFO  fetcher.FetcherJob - Fetcher: throughput
threshold sequence: 5
2012-12-27 20:47:38,195 INFO  fetcher.FetcherJob - fetching
http://www.financialtime.com/
2012-12-27 20:47:38,198 INFO  http.Http - http.proxy.host = null
2012-12-27 20:47:38,198 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:47:38,198 INFO  http.Http - http.timeout = 10000
2012-12-27 20:47:38,198 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:47:38,198 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:47:38,198 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:47:38,198 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:47:38,202 INFO  fetcher.FetcherJob - QueueFeeder finished:
total 5 records. Hit by time limit :0
2012-12-27 20:47:38,699 INFO  fetcher.FetcherJob - fetching
http://www.nytimes.com/
2012-12-27 20:47:38,700 INFO  fetcher.FetcherJob - fetching
http://www.booking.com/
2012-12-27 20:47:38,701 INFO  fetcher.FetcherJob - fetching
http://www.tripadvisor.com/
2012-12-27 20:47:39,070 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:47:39,236 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob - 10/10
spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 204 204 kb/s, 1
URLs in 1 queues
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob - * queue:
http://www.booking.com
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   maxThreads    = 1
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   inProgress    = 0
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   crawlDelay    = 5000
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   minCrawlDelay = 0
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   nextFetchTime =
1356641264070
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   now           =
1356641263193
2012-12-27 20:47:43,193 INFO  fetcher.FetcherJob -   0.
http://www.booking.com/index.en.html
2012-12-27 20:47:44,136 INFO  fetcher.FetcherJob - fetching
http://www.booking.com/index.en.html
2012-12-27 20:47:44,172 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread9, activeThreads=9
2012-12-27 20:47:44,213 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread7, activeThreads=8
2012-12-27 20:47:44,213 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread8, activeThreads=7
2012-12-27 20:47:44,213 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread3, activeThreads=6
2012-12-27 20:47:44,213 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread4, activeThreads=5
2012-12-27 20:47:44,213 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread6, activeThreads=4
2012-12-27 20:47:44,213 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread5, activeThreads=3
2012-12-27 20:47:44,253 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread0, activeThreads=2
2012-12-27 20:47:44,510 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread2, activeThreads=1
2012-12-27 20:47:45,686 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread1, activeThreads=0
2012-12-27 20:47:48,194 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active,
5 pages, 0 errors, 0.5 0.2 pages/s, 153 102 kb/s, 0 URLs in 0 queues
2012-12-27 20:47:48,194 INFO  fetcher.FetcherJob - -activeThreads=0
2012-12-27 20:47:50,173 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:51,168 INFO  parse.ParserJob - ParserJob: resuming: false
2012-12-27 20:47:51,168 INFO  parse.ParserJob - ParserJob: forced reparse:
false
2012-12-27 20:47:51,168 INFO  parse.ParserJob - ParserJob: parsing all
2012-12-27 20:47:51,705 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:47:51,822 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:51,839 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:51,847 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:47:51,886 INFO  parse.ParserJob - Parsing
http://www.booking.com/
2012-12-27 20:47:51,890 INFO  parse.ParserJob - Parsing
http://www.booking.com/index.en.html
2012-12-27 20:47:51,967 WARN  parse.ParseUtil - Error parsing
http://www.booking.com/index.en.html
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:47:51,969 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.booking.com/index.en.html of type text/html
2012-12-27 20:47:51,996 INFO  parse.ParserJob - Parsing
http://www.financialtime.com/
2012-12-27 20:47:51,996 WARN  parse.ParserJob -
http://www.financialtime.com/ skipped. Content of size 20 was truncated to 0
2012-12-27 20:47:51,999 INFO  parse.ParserJob - Parsing
http://www.nytimes.com/
2012-12-27 20:47:52,001 WARN  parse.ParseUtil - Error parsing
http://www.nytimes.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:47:52,002 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.nytimes.com/ of type text/html
2012-12-27 20:47:52,012 INFO  parse.ParserJob - Parsing
http://www.tripadvisor.com/
2012-12-27 20:47:52,018 WARN  parse.ParseUtil - Error parsing
http://www.tripadvisor.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:47:52,019 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.tripadvisor.com/ of type text/html
2012-12-27 20:47:54,792 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:55,927 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:58,897 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:58,897 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:47:58,898 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:47:58,898 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:01,894 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:02,974 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:03,076 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:48:03,077 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:48:03,077 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:05,963 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:08,959 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:09,953 INFO  fetcher.FetcherJob - FetcherJob: threads: 10
2012-12-27 20:48:09,953 INFO  fetcher.FetcherJob - FetcherJob: parsing:
false
2012-12-27 20:48:09,953 INFO  fetcher.FetcherJob - FetcherJob: resuming:
false
2012-12-27 20:48:09,953 INFO  fetcher.FetcherJob - FetcherJob : timelimit
set for : -1
2012-12-27 20:48:09,956 INFO  http.Http - http.proxy.host = null
2012-12-27 20:48:09,956 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:48:09,956 INFO  http.Http - http.timeout = 10000
2012-12-27 20:48:09,956 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:48:09,956 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:09,956 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:09,956 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:10,029 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:13,030 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:13,031 INFO  fetcher.FetcherJob - Using queue mode : byHost
2012-12-27 20:48:13,031 INFO  fetcher.FetcherJob - Fetcher: threads: 10
2012-12-27 20:48:13,036 INFO  fetcher.FetcherJob - Fetcher: throughput
threshold: -1
2012-12-27 20:48:13,036 INFO  fetcher.FetcherJob - Fetcher: throughput
threshold sequence: 5
2012-12-27 20:48:13,039 INFO  fetcher.FetcherJob - fetching
http://www.booking.com/index.en.html
2012-12-27 20:48:13,040 INFO  fetcher.FetcherJob - fetching
http://www.nytimes.com/
2012-12-27 20:48:13,044 INFO  http.Http - http.proxy.host = null
2012-12-27 20:48:13,044 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:48:13,044 INFO  http.Http - http.timeout = 10000
2012-12-27 20:48:13,044 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:48:13,044 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:13,044 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:13,044 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:13,044 INFO  fetcher.FetcherJob - QueueFeeder finished:
total 5 records. Hit by time limit :0
2012-12-27 20:48:13,046 INFO  http.Http - http.proxy.host = null
2012-12-27 20:48:13,046 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:48:13,047 INFO  http.Http - http.timeout = 10000
2012-12-27 20:48:13,047 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:48:13,047 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:13,047 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:13,047 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:13,541 INFO  fetcher.FetcherJob - fetching
http://www.financialtime.com/
2012-12-27 20:48:13,541 INFO  fetcher.FetcherJob - fetching
http://www.tripadvisor.com/
2012-12-27 20:48:14,093 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:48:18,036 INFO  fetcher.FetcherJob - 10/10
spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 307 307 kb/s, 1
URLs in 1 queues
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob - * queue:
http://www.booking.com
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   maxThreads    = 1
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   inProgress    = 0
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   crawlDelay    = 5000
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   minCrawlDelay = 0
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   nextFetchTime =
1356641298695
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   now           =
1356641298037
2012-12-27 20:48:18,037 INFO  fetcher.FetcherJob -   0.
http://www.booking.com/
2012-12-27 20:48:18,703 INFO  fetcher.FetcherJob - fetching
http://www.booking.com/
2012-12-27 20:48:18,755 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread0, activeThreads=9
2012-12-27 20:48:18,811 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:48:18,812 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread9, activeThreads=8
2012-12-27 20:48:19,052 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread5, activeThreads=7
2012-12-27 20:48:19,053 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread1, activeThreads=6
2012-12-27 20:48:19,053 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread4, activeThreads=5
2012-12-27 20:48:19,053 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread6, activeThreads=4
2012-12-27 20:48:19,053 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread3, activeThreads=3
2012-12-27 20:48:19,053 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread2, activeThreads=2
2012-12-27 20:48:19,073 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread7, activeThreads=1
2012-12-27 20:48:19,105 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread8, activeThreads=0
2012-12-27 20:48:23,037 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active,
5 pages, 0 errors, 0.5 0.2 pages/s, 153 0 kb/s, 0 URLs in 0 queues
2012-12-27 20:48:23,037 INFO  fetcher.FetcherJob - -activeThreads=0
2012-12-27 20:48:25,027 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:26,024 INFO  parse.ParserJob - ParserJob: resuming: false
2012-12-27 20:48:26,024 INFO  parse.ParserJob - ParserJob: forced reparse:
false
2012-12-27 20:48:26,024 INFO  parse.ParserJob - ParserJob: parsing all
2012-12-27 20:48:26,041 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:26,115 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:26,122 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:26,123 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:26,131 INFO  parse.ParserJob - Parsing
http://www.booking.com/
2012-12-27 20:48:26,136 INFO  parse.ParserJob - Parsing
http://www.booking.com/index.en.html
2012-12-27 20:48:26,140 WARN  parse.ParseUtil - Error parsing
http://www.booking.com/index.en.html
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:26,141 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.booking.com/index.en.html of type text/html
2012-12-27 20:48:26,141 INFO  parse.ParserJob - Parsing
http://www.financialtime.com/
2012-12-27 20:48:26,142 WARN  parse.ParserJob -
http://www.financialtime.com/ skipped. Content of size 20 was truncated to 0
2012-12-27 20:48:26,142 INFO  parse.ParserJob - Parsing
http://www.nytimes.com/
2012-12-27 20:48:26,144 WARN  parse.ParseUtil - Error parsing
http://www.nytimes.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:26,144 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.nytimes.com/ of type text/html
2012-12-27 20:48:26,145 INFO  parse.ParserJob - Parsing
http://www.tripadvisor.com/
2012-12-27 20:48:26,147 WARN  parse.ParseUtil - Error parsing
http://www.tripadvisor.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:26,147 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.tripadvisor.com/ of type text/html
2012-12-27 20:48:29,094 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:30,164 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:33,158 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:33,158 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:48:33,158 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:48:33,158 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:36,153 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:37,213 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:37,330 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:48:37,330 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:48:37,330 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:40,205 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:43,200 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:44,198 INFO  fetcher.FetcherJob - FetcherJob: threads: 10
2012-12-27 20:48:44,198 INFO  fetcher.FetcherJob - FetcherJob: parsing:
false
2012-12-27 20:48:44,198 INFO  fetcher.FetcherJob - FetcherJob: resuming:
false
2012-12-27 20:48:44,198 INFO  fetcher.FetcherJob - FetcherJob : timelimit
set for : -1
2012-12-27 20:48:44,200 INFO  http.Http - http.proxy.host = null
2012-12-27 20:48:44,200 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:48:44,200 INFO  http.Http - http.timeout = 10000
2012-12-27 20:48:44,201 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:48:44,201 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:44,201 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:44,201 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:44,292 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:47,279 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:47,280 INFO  fetcher.FetcherJob - Using queue mode : byHost
2012-12-27 20:48:47,280 INFO  fetcher.FetcherJob - Fetcher: threads: 10
2012-12-27 20:48:47,285 INFO  fetcher.FetcherJob - Fetcher: throughput
threshold: -1
2012-12-27 20:48:47,285 INFO  fetcher.FetcherJob - Fetcher: throughput
threshold sequence: 5
2012-12-27 20:48:47,288 INFO  fetcher.FetcherJob - fetching
http://www.booking.com/
2012-12-27 20:48:47,290 INFO  http.Http - http.proxy.host = null
2012-12-27 20:48:47,292 INFO  fetcher.FetcherJob - QueueFeeder finished:
total 5 records. Hit by time limit :0
2012-12-27 20:48:47,290 INFO  http.Http - http.proxy.port = 8080
2012-12-27 20:48:47,292 INFO  http.Http - http.timeout = 10000
2012-12-27 20:48:47,292 INFO  http.Http - http.content.limit = 65536
2012-12-27 20:48:47,292 INFO  http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:47,292 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:47,292 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:47,523 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:48:47,524 INFO  fetcher.FetcherJob - fetching
http://www.financialtime.com/
2012-12-27 20:48:47,791 INFO  fetcher.FetcherJob - fetching
http://www.nytimes.com/
2012-12-27 20:48:47,792 INFO  fetcher.FetcherJob - fetching
http://www.tripadvisor.com/
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob - 10/10
spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 204 204 kb/s, 1
URLs in 1 queues
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob - * queue:
http://www.booking.com
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   maxThreads    = 1
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   inProgress    = 0
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   crawlDelay    = 5000
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   minCrawlDelay = 0
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   nextFetchTime =
1356641332521
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   now           =
1356641332285
2012-12-27 20:48:52,285 INFO  fetcher.FetcherJob -   0.
http://www.booking.com/index.en.html
2012-12-27 20:48:52,541 INFO  fetcher.FetcherJob - fetching
http://www.booking.com/index.en.html
2012-12-27 20:48:52,565 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread8, activeThreads=9
2012-12-27 20:48:52,566 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread7, activeThreads=8
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread6, activeThreads=7
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread2, activeThreads=6
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread4, activeThreads=5
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread0, activeThreads=4
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread5, activeThreads=3
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread1, activeThreads=2
2012-12-27 20:48:52,802 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread3, activeThreads=1
2012-12-27 20:48:53,419 INFO  fetcher.FetcherJob - -finishing thread
FetcherThread9, activeThreads=0
2012-12-27 20:48:57,286 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active,
5 pages, 0 errors, 0.5 0.2 pages/s, 153 102 kb/s, 0 URLs in 0 queues
2012-12-27 20:48:57,286 INFO  fetcher.FetcherJob - -activeThreads=0
2012-12-27 20:48:59,279 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:59,280 INFO  parse.ParserJob - ParserJob: resuming: false
2012-12-27 20:48:59,280 INFO  parse.ParserJob - ParserJob: forced reparse:
false
2012-12-27 20:48:59,280 INFO  parse.ParserJob - ParserJob: parsing all
2012-12-27 20:48:59,295 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:59,364 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:59,371 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:59,372 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:59,384 INFO  parse.ParserJob - Parsing
http://www.booking.com/
2012-12-27 20:48:59,385 INFO  parse.ParserJob - Parsing
http://www.booking.com/index.en.html
2012-12-27 20:48:59,390 WARN  parse.ParseUtil - Error parsing
http://www.booking.com/index.en.html
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:59,391 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.booking.com/index.en.html of type text/html
2012-12-27 20:48:59,391 INFO  parse.ParserJob - Parsing
http://www.financialtime.com/
2012-12-27 20:48:59,392 WARN  parse.ParserJob -
http://www.financialtime.com/ skipped. Content of size 20 was truncated to 0
2012-12-27 20:48:59,392 INFO  parse.ParserJob - Parsing
http://www.nytimes.com/
2012-12-27 20:48:59,394 WARN  parse.ParseUtil - Error parsing
http://www.nytimes.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:59,394 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.nytimes.com/ of type text/html
2012-12-27 20:48:59,395 INFO  parse.ParserJob - Parsing
http://www.tripadvisor.com/
2012-12-27 20:48:59,398 WARN  parse.ParseUtil - Error parsing
http://www.tripadvisor.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:59,398 WARN  parse.ParseUtil - Unable to successfully
parse content http://www.tripadvisor.com/ of type text/html
2012-12-27 20:49:02,346 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:49:03,416 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:49:06,406 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:49:06,407 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:49:06,407 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:49:06,407 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:49:09,403 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup


------------------------------------------------------------------------

RE: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Markus Jelsma <ma...@openindex.io>.

Can't see them either.
 
-----Original message-----
> From:Tejas Patil <te...@gmail.com>
> Sent: Fri 04-Jan-2013 10:27
> To: user@nutch.apache.org
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents
> 
> Is it just me or nobody else cant see the images attached inline by Arcondo
> ?
> 
> 
> On Fri, Jan 4, 2013 at 1:18 AM, Tejas Patil <te...@gmail.com>wrote:
> 
> >
> >
> >
> > On Thu, Jan 3, 2013 at 10:38 PM, Arcondo Dasilva <
> > arcondo.dasilva@gmail.com> wrote:
> >
> >> Hi Lewis,
> >>
> >> Thanks for your feedback. I went through the process step by step and I'm
> >> still getting the error :
> >>
> >> my plugins folder looks like this :
> >>
> >> [image: Inline image 1]
> >>
> >> When I ran the parse job it gave me this :
> >>
> >> [image: Inline image 2]
> >>
> >> when I look at the log file, I get this :
> >>
> >> [image: Inline image 3]
> >>
> >> My nutch-site.xml contains this :
> >>
> >> <property>
> >>   <name>plugin.includes</name>
> >>
> >> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> >>  <description>Regular expression naming plugin directory names to
> >>   include.  Any plugin not matching this expression is excluded.
> >>   In any case you need at least include the nutch-extensionpoints plugin.
> >> By
> >>   default Nutch includes crawling just HTML and plain text via HTTP,
> >>   and basic indexing and search plugins. In order to use HTTPS please
> >> enable
> >>   protocol-httpclient, but be aware of possible intermittent problems
> >> with the
> >>   underlying commons-httpclient library.
> >>   </description>
> >> </property>
> >>
> >>
> >> am I missing something else ?
> >>
> >> Thanks for your precious help.
> >>
> >> Arcondo.
> >>
> >>
> >>
> >> On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >>> Hi Arcondo,
> >>>
> >>> The nekohtml jar should be version 0.9.5, and should reside in
> >>> build/plugins/lib-nekohtml once you build Nutch from source.
> >>> Once you use the default 'runtime' target, the corresponding plugins
> >>> folders should be copied into runtime/local/plugins
> >>> Can you check that the jar is copied to this directory before attempting
> >>> to
> >>> parse th6e URLs in your segment(s) if using 1.x.
> >>> I'm also assuming that you have parse-html included in the
> >>> plugin.includes
> >>> property within nutch-site.xml before building the source.
> >>>
> >>> Lewis
> >>>
> >>> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
> >>> <ar...@gmail.com>wrote:
> >>>
> >>> > Thanks for the explanation. I'm more a functional guy with no solid
> >>> > background in Java.
> >>> > Could you give some details on how to enforce it manually ?
> >>> >
> >>> > Thanks in advance, Arcondo
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
> >>> > lewis.mcgibbney@gmail.com> wrote:
> >>> >
> >>> > > the jar is not on the classpath
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>
> >>
> >
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by al...@aim.com.

Hi,

You can unjar the jar file, check if the class that parse complains about is inside it. You can also try to put content of jar file under local /lib. Maybe there is some read restriction. If this does not help, I can only suggest to start again with a new copy of nutch.

Alex.

 

 

 

-----Original Message-----
From: Arcondo Dasilva <ar...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Sat, Jan 5, 2013 1:11 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents


Hi Alex,

I'm using 2.1 version / hbase 0.90.6 / solr 4.0
everything works fine except I'm not able to parse the contents of my url
because of the error Nekohtml not found.

my plugins include looks like this :

<value>protocol-http|urlfilter-regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|lib-nekohtml</value>

I added  lib-nekohtml at the end of the allowed values but seems that has
no effect on the error.

in my runtime/local/plugins/lib-nekohtml, I have the jar file
: nekohtml-0.9.5.jar

is there something I should look for beside this ?

Thanks a lot for your help.

Kr, Arcondo


On Fri, Jan 4, 2013 at 11:33 PM, <al...@aim.com> wrote:

> Which version of nutch  is this? Did you follow the tutorial? I can help
> yuu if you provide all steps you did, starting with downloading nutch.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo Dasilva <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 1:23 pm
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hi Alex,
>
> I tried. That was the first thing I did but without success.
> I don't understand why I'm obliged to use Neko instead of Tika. As far as I
> know tika can parse more than 1200 different formats
>
> Kr, Arcondo
>
>
> On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:
>
> > move or copy that jar file to local/lib and try again.
> >
> > hth.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Arcondo <ar...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Fri, Jan 4, 2013 2:55 am
> > Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> > contents
> >
> >
> > Hope that now you can see them
> >
> > Plugin folder
> > <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
> >
> > Parse Job
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
> >
> > Parse error : Hadoop.log
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
> >
> > My nutch-site.xm (plugin includes)
> >
> > <property>
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> >  <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints plugin.
> >  By default Nutch includes crawling just HTML and plain text via HTTP,
> >    and basic indexing and search plugins. In order to use HTTPS please
> >  enable
> >    protocol-httpclient, but be aware of possible intermittent problems
> >  with the
> >   underlying commons-httpclient library.
> >   </description>
> >  </property>
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
>
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Arcondo Dasilva <ar...@gmail.com>.

Hi Alex,

I'm using 2.1 version / hbase 0.90.6 / solr 4.0
everything works fine except I'm not able to parse the contents of my url
because of the error Nekohtml not found.

my plugins include looks like this :

<value>protocol-http|urlfilter-regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|lib-nekohtml</value>

I added  lib-nekohtml at the end of the allowed values but seems that has
no effect on the error.

in my runtime/local/plugins/lib-nekohtml, I have the jar file
: nekohtml-0.9.5.jar

is there something I should look for beside this ?

Thanks a lot for your help.

Kr, Arcondo


On Fri, Jan 4, 2013 at 11:33 PM, <al...@aim.com> wrote:

> Which version of nutch  is this? Did you follow the tutorial? I can help
> yuu if you provide all steps you did, starting with downloading nutch.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo Dasilva <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 1:23 pm
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hi Alex,
>
> I tried. That was the first thing I did but without success.
> I don't understand why I'm obliged to use Neko instead of Tika. As far as I
> know tika can parse more than 1200 different formats
>
> Kr, Arcondo
>
>
> On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:
>
> > move or copy that jar file to local/lib and try again.
> >
> > hth.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Arcondo <ar...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Fri, Jan 4, 2013 2:55 am
> > Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> > contents
> >
> >
> > Hope that now you can see them
> >
> > Plugin folder
> > <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
> >
> > Parse Job
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
> >
> > Parse error : Hadoop.log
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
> >
> > My nutch-site.xm (plugin includes)
> >
> > <property>
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> >  <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints plugin.
> >  By default Nutch includes crawling just HTML and plain text via HTTP,
> >    and basic indexing and search plugins. In order to use HTTPS please
> >  enable
> >    protocol-httpclient, but be aware of possible intermittent problems
> >  with the
> >   underlying commons-httpclient library.
> >   </description>
> >  </property>
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
>
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by al...@aim.com.

Which version of nutch  is this? Did you follow the tutorial? I can help yuu if you provide all steps you did, starting with downloading nutch.

Alex.

 

 

 

-----Original Message-----
From: Arcondo Dasilva <ar...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Jan 4, 2013 1:23 pm
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents


Hi Alex,

I tried. That was the first thing I did but without success.
I don't understand why I'm obliged to use Neko instead of Tika. As far as I
know tika can parse more than 1200 different formats

Kr, Arcondo


On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:

> move or copy that jar file to local/lib and try again.
>
> hth.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 2:55 am
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hope that now you can see them
>
> Plugin folder
> <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
>
> Parse Job
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
>
> Parse error : Hadoop.log
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
>
> My nutch-site.xm (plugin includes)
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
>  By default Nutch includes crawling just HTML and plain text via HTTP,
>    and basic indexing and search plugins. In order to use HTTPS please
>  enable
>    protocol-httpclient, but be aware of possible intermittent problems
>  with the
>   underlying commons-httpclient library.
>   </description>
>  </property>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Arcondo Dasilva <ar...@gmail.com>.

Hi Alex,

I tried. That was the first thing I did but without success.
I don't understand why I'm obliged to use Neko instead of Tika. As far as I
know tika can parse more than 1200 different formats

Kr, Arcondo


On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:

> move or copy that jar file to local/lib and try again.
>
> hth.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 2:55 am
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hope that now you can see them
>
> Plugin folder
> <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
>
> Parse Job
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
>
> Parse error : Hadoop.log
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
>
> My nutch-site.xm (plugin includes)
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
>  By default Nutch includes crawling just HTML and plain text via HTTP,
>    and basic indexing and search plugins. In order to use HTTPS please
>  enable
>    protocol-httpclient, but be aware of possible intermittent problems
>  with the
>   underlying commons-httpclient library.
>   </description>
>  </property>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by al...@aim.com.

move or copy that jar file to local/lib and try again.

hth.
Alex.

 

 

 

-----Original Message-----
From: Arcondo <ar...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Jan 4, 2013 2:55 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents


Hope that now you can see them

Plugin folder
<http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png> 

Parse Job

<http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png> 

Parse error : Hadoop.log

<http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png> 

My nutch-site.xm (plugin includes)

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin.
 By default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please
 enable
   protocol-httpclient, but be aware of possible intermittent problems
 with the
  underlying commons-httpclient library.
  </description>
 </property>








--
View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Arcondo <ar...@gmail.com>.

Hope that now you can see them

Plugin folder
<http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png> 

Parse Job

<http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png> 

Parse error : Hadoop.log

<http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png> 

My nutch-site.xm (plugin includes)

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin.
 By default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please
 enable
   protocol-httpclient, but be aware of possible intermittent problems
 with the
  underlying commons-httpclient library.
  </description>
 </property>








--
View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Tejas Patil <te...@gmail.com>.

Is it just me or nobody else cant see the images attached inline by Arcondo
?


On Fri, Jan 4, 2013 at 1:18 AM, Tejas Patil <te...@gmail.com>wrote:

>
>
>
> On Thu, Jan 3, 2013 at 10:38 PM, Arcondo Dasilva <
> arcondo.dasilva@gmail.com> wrote:
>
>> Hi Lewis,
>>
>> Thanks for your feedback. I went through the process step by step and I'm
>> still getting the error :
>>
>> my plugins folder looks like this :
>>
>> [image: Inline image 1]
>>
>> When I ran the parse job it gave me this :
>>
>> [image: Inline image 2]
>>
>> when I look at the log file, I get this :
>>
>> [image: Inline image 3]
>>
>> My nutch-site.xml contains this :
>>
>> <property>
>>   <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>>  <description>Regular expression naming plugin directory names to
>>   include.  Any plugin not matching this expression is excluded.
>>   In any case you need at least include the nutch-extensionpoints plugin.
>> By
>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>   and basic indexing and search plugins. In order to use HTTPS please
>> enable
>>   protocol-httpclient, but be aware of possible intermittent problems
>> with the
>>   underlying commons-httpclient library.
>>   </description>
>> </property>
>>
>>
>> am I missing something else ?
>>
>> Thanks for your precious help.
>>
>> Arcondo.
>>
>>
>>
>> On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi Arcondo,
>>>
>>> The nekohtml jar should be version 0.9.5, and should reside in
>>> build/plugins/lib-nekohtml once you build Nutch from source.
>>> Once you use the default 'runtime' target, the corresponding plugins
>>> folders should be copied into runtime/local/plugins
>>> Can you check that the jar is copied to this directory before attempting
>>> to
>>> parse th6e URLs in your segment(s) if using 1.x.
>>> I'm also assuming that you have parse-html included in the
>>> plugin.includes
>>> property within nutch-site.xml before building the source.
>>>
>>> Lewis
>>>
>>> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
>>> <ar...@gmail.com>wrote:
>>>
>>> > Thanks for the explanation. I'm more a functional guy with no solid
>>> > background in Java.
>>> > Could you give some details on how to enforce it manually ?
>>> >
>>> > Thanks in advance, Arcondo
>>> >
>>> >
>>> >
>>> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
>>> > lewis.mcgibbney@gmail.com> wrote:
>>> >
>>> > > the jar is not on the classpath
>>> >
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Tejas Patil <te...@gmail.com>.

On Thu, Jan 3, 2013 at 10:38 PM, Arcondo Dasilva
<ar...@gmail.com>wrote:

> Hi Lewis,
>
> Thanks for your feedback. I went through the process step by step and I'm
> still getting the error :
>
> my plugins folder looks like this :
>
> [image: Inline image 1]
>
> When I ran the parse job it gave me this :
>
> [image: Inline image 2]
>
> when I look at the log file, I get this :
>
> [image: Inline image 3]
>
> My nutch-site.xml contains this :
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
>
> am I missing something else ?
>
> Thanks for your precious help.
>
> Arcondo.
>
>
>
> On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Arcondo,
>>
>> The nekohtml jar should be version 0.9.5, and should reside in
>> build/plugins/lib-nekohtml once you build Nutch from source.
>> Once you use the default 'runtime' target, the corresponding plugins
>> folders should be copied into runtime/local/plugins
>> Can you check that the jar is copied to this directory before attempting
>> to
>> parse th6e URLs in your segment(s) if using 1.x.
>> I'm also assuming that you have parse-html included in the plugin.includes
>> property within nutch-site.xml before building the source.
>>
>> Lewis
>>
>> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
>> <ar...@gmail.com>wrote:
>>
>> > Thanks for the explanation. I'm more a functional guy with no solid
>> > background in Java.
>> > Could you give some details on how to enforce it manually ?
>> >
>> > Thanks in advance, Arcondo
>> >
>> >
>> >
>> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
>> > lewis.mcgibbney@gmail.com> wrote:
>> >
>> > > the jar is not on the classpath
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Arcondo Dasilva <ar...@gmail.com>.

Hi Lewis,

Thanks for your feedback. I went through the process step by step and I'm
still getting the error :

my plugins folder looks like this :

[image: Inline image 1]

When I ran the parse job it gave me this :

[image: Inline image 2]

when I look at the log file, I get this :

[image: Inline image 3]

My nutch-site.xml contains this :

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

am I missing something else ?

Thanks for your precious help.

Arcondo.

On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Arcondo,
>
> The nekohtml jar should be version 0.9.5, and should reside in
> build/plugins/lib-nekohtml once you build Nutch from source.
> Once you use the default 'runtime' target, the corresponding plugins
> folders should be copied into runtime/local/plugins
> Can you check that the jar is copied to this directory before attempting to
> parse th6e URLs in your segment(s) if using 1.x.
> I'm also assuming that you have parse-html included in the plugin.includes
> property within nutch-site.xml before building the source.
>
> Lewis
>
> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
> <ar...@gmail.com>wrote:
>
> > Thanks for the explanation. I'm more a functional guy with no solid
> > background in Java.
> > Could you give some details on how to enforce it manually ?
> >
> > Thanks in advance, Arcondo
> >
> >
> >
> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > the jar is not on the classpath
> >
>
>
>
> --
> *Lewis*
>

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Arcondo,

The nekohtml jar should be version 0.9.5, and should reside in
build/plugins/lib-nekohtml once you build Nutch from source.
Once you use the default 'runtime' target, the corresponding plugins
folders should be copied into runtime/local/plugins
Can you check that the jar is copied to this directory before attempting to
parse th6e URLs in your segment(s) if using 1.x.
I'm also assuming that you have parse-html included in the plugin.includes
property within nutch-site.xml before building the source.

Lewis

On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
<ar...@gmail.com>wrote:

> Thanks for the explanation. I'm more a functional guy with no solid
> background in Java.
> Could you give some details on how to enforce it manually ?
>
> Thanks in advance, Arcondo
>
>
>
> On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > the jar is not on the classpath
>

-- 
*Lewis*

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Arcondo Dasilva <ar...@gmail.com>.

Thanks for the explanation. I'm more a functional guy with no solid
background in Java.
Could you give some details on how to enforce it manually ?

Thanks in advance, Arcondo



On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> the jar is not on the classpath

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Arcondo,

As Tejas pointed out, the jar is not on the classpath. This should be
automated by the Ant and Ivy configuration in Nutch however if it is not
then simply manually enforce it.

Lewis
On Wed, Jan 2, 2013 at 9:43 PM, Arcondo <ar...@gmail.com> wrote:

> Hello,
>
> I made an "ant clean" and then I rebuild and still getting the same issue.
> I checked in my ivy2 folder :
>
> <http://lucene.472066.n3.nabble.com/file/n4030135/nekohtml.png>
>
>
> and I still getting : java.lang.ClassNotFoundException:
> *org.cyberneko.html.HTMLComponent*
>
> any other insights ?
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030135.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Arcondo <ar...@gmail.com>.

Hello,

I made an "ant clean" and then I rebuild and still getting the same issue.
I checked in my ivy2 folder :

<http://lucene.472066.n3.nabble.com/file/n4030135/nekohtml.png> 


and I still getting : java.lang.ClassNotFoundException:
*org.cyberneko.html.HTMLComponent*

any other insights ?

Thanks, 



--
View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030135.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Native Hadoop library not loaded and Cannot parse sites contents

Posted by Tejas Patil <te...@gmail.com>.

The exception indicates that the nekohtml jar is not present. In case you
are using the source distribution, do an "ant clean" and then build again
in shell. The nekohtml jar must be present at location
{$USER_HOME}/.ivy2/cache/nekohtml/nekohtml/jars.

Thanks,
Tejas Patil

On Thu, Dec 27, 2012 at 1:26 PM, Arcondo Dasilva
<ar...@gmail.com>wrote:

> org/cyberneko/html/HTMLComponent
>