You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arcondo Dasilva <ar...@gmail.com> on 2012/12/27 22:26:01 UTC
Native Hadoop library not loaded and Cannot parse sites contents
Hello,
When I read the hadoop.log, there are lot of things who are going wrong. I
googled but seems that kind of errors are not often reported. could you
please help me figure out how I can get rid of this.
Thanks in advance for your help,
I also joined the nutch-site.xml and regex-urlfilter.txt
my crawl command : *bin/nutch crawl urls -depth 3 -topN 500 -threads 10*
my hadoop.log
uname -a : *Linux drupal7 2.6.32-5-686 #1 SMP Sun May 6 04:01:19 UTC 2012
i686 GNU/Linux*
------------------------------------------------------------------------
2012-12-27 20:47:23,152 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-12-27 20:47:23,239 WARN snappy.LoadSnappy - Snappy native library not
loaded
2012-12-27 20:47:23,686 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:23,691 INFO plugin.PluginRepository - Plugins: looking
in: /opt/nutch21/runtime/local/plugins
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Registered Plugins:
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Tika Parser Plug-in
(parse-tika)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Registered
Extension-Points:
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Parse Filter
(org.apache.nutch.parse.ParseFilter)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-12-27 20:47:24,035 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-12-27 20:47:24,221 INFO regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2012-12-27 20:47:26,608 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:27,796 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:28,449 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:47:28,449 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:47:28,449 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:47:28,499 INFO regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
2012-12-27 20:47:30,815 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:33,751 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:34,741 INFO fetcher.FetcherJob - FetcherJob: threads: 10
2012-12-27 20:47:34,741 INFO fetcher.FetcherJob - FetcherJob: parsing:
false
2012-12-27 20:47:34,741 INFO fetcher.FetcherJob - FetcherJob: resuming:
false
2012-12-27 20:47:34,741 INFO fetcher.FetcherJob - FetcherJob : timelimit
set for : -1
2012-12-27 20:47:35,072 INFO http.Http - http.proxy.host = null
2012-12-27 20:47:35,073 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:47:35,073 INFO http.Http - http.timeout = 10000
2012-12-27 20:47:35,073 INFO http.Http - http.content.limit = 65536
2012-12-27 20:47:35,073 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:47:35,073 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:47:35,073 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:47:35,190 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:38,178 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:38,180 INFO fetcher.FetcherJob - Using queue mode : byHost
2012-12-27 20:47:38,180 INFO fetcher.FetcherJob - Fetcher: threads: 10
2012-12-27 20:47:38,192 INFO fetcher.FetcherJob - Fetcher: throughput
threshold: -1
2012-12-27 20:47:38,192 INFO fetcher.FetcherJob - Fetcher: throughput
threshold sequence: 5
2012-12-27 20:47:38,195 INFO fetcher.FetcherJob - fetching
http://www.financialtime.com/
2012-12-27 20:47:38,198 INFO http.Http - http.proxy.host = null
2012-12-27 20:47:38,198 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:47:38,198 INFO http.Http - http.timeout = 10000
2012-12-27 20:47:38,198 INFO http.Http - http.content.limit = 65536
2012-12-27 20:47:38,198 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:47:38,198 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:47:38,198 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:47:38,202 INFO fetcher.FetcherJob - QueueFeeder finished:
total 5 records. Hit by time limit :0
2012-12-27 20:47:38,699 INFO fetcher.FetcherJob - fetching
http://www.nytimes.com/
2012-12-27 20:47:38,700 INFO fetcher.FetcherJob - fetching
http://www.booking.com/
2012-12-27 20:47:38,701 INFO fetcher.FetcherJob - fetching
http://www.tripadvisor.com/
2012-12-27 20:47:39,070 INFO regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:47:39,236 INFO regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - 10/10
spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 204 204 kb/s, 1
URLs in 1 queues
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - * queue:
http://www.booking.com
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - maxThreads = 1
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - inProgress = 0
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - crawlDelay = 5000
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - minCrawlDelay = 0
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - nextFetchTime =
1356641264070
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - now =
1356641263193
2012-12-27 20:47:43,193 INFO fetcher.FetcherJob - 0.
http://www.booking.com/index.en.html
2012-12-27 20:47:44,136 INFO fetcher.FetcherJob - fetching
http://www.booking.com/index.en.html
2012-12-27 20:47:44,172 INFO fetcher.FetcherJob - -finishing thread
FetcherThread9, activeThreads=9
2012-12-27 20:47:44,213 INFO fetcher.FetcherJob - -finishing thread
FetcherThread7, activeThreads=8
2012-12-27 20:47:44,213 INFO fetcher.FetcherJob - -finishing thread
FetcherThread8, activeThreads=7
2012-12-27 20:47:44,213 INFO fetcher.FetcherJob - -finishing thread
FetcherThread3, activeThreads=6
2012-12-27 20:47:44,213 INFO fetcher.FetcherJob - -finishing thread
FetcherThread4, activeThreads=5
2012-12-27 20:47:44,213 INFO fetcher.FetcherJob - -finishing thread
FetcherThread6, activeThreads=4
2012-12-27 20:47:44,213 INFO fetcher.FetcherJob - -finishing thread
FetcherThread5, activeThreads=3
2012-12-27 20:47:44,253 INFO fetcher.FetcherJob - -finishing thread
FetcherThread0, activeThreads=2
2012-12-27 20:47:44,510 INFO fetcher.FetcherJob - -finishing thread
FetcherThread2, activeThreads=1
2012-12-27 20:47:45,686 INFO fetcher.FetcherJob - -finishing thread
FetcherThread1, activeThreads=0
2012-12-27 20:47:48,194 INFO fetcher.FetcherJob - 0/0 spinwaiting/active,
5 pages, 0 errors, 0.5 0.2 pages/s, 153 102 kb/s, 0 URLs in 0 queues
2012-12-27 20:47:48,194 INFO fetcher.FetcherJob - -activeThreads=0
2012-12-27 20:47:50,173 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:51,168 INFO parse.ParserJob - ParserJob: resuming: false
2012-12-27 20:47:51,168 INFO parse.ParserJob - ParserJob: forced reparse:
false
2012-12-27 20:47:51,168 INFO parse.ParserJob - ParserJob: parsing all
2012-12-27 20:47:51,705 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:47:51,822 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:51,839 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:51,847 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:47:51,886 INFO parse.ParserJob - Parsing
http://www.booking.com/
2012-12-27 20:47:51,890 INFO parse.ParserJob - Parsing
http://www.booking.com/index.en.html
2012-12-27 20:47:51,967 WARN parse.ParseUtil - Error parsing
http://www.booking.com/index.en.html
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:47:51,969 WARN parse.ParseUtil - Unable to successfully
parse content http://www.booking.com/index.en.html of type text/html
2012-12-27 20:47:51,996 INFO parse.ParserJob - Parsing
http://www.financialtime.com/
2012-12-27 20:47:51,996 WARN parse.ParserJob -
http://www.financialtime.com/ skipped. Content of size 20 was truncated to 0
2012-12-27 20:47:51,999 INFO parse.ParserJob - Parsing
http://www.nytimes.com/
2012-12-27 20:47:52,001 WARN parse.ParseUtil - Error parsing
http://www.nytimes.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:47:52,002 WARN parse.ParseUtil - Unable to successfully
parse content http://www.nytimes.com/ of type text/html
2012-12-27 20:47:52,012 INFO parse.ParserJob - Parsing
http://www.tripadvisor.com/
2012-12-27 20:47:52,018 WARN parse.ParseUtil - Error parsing
http://www.tripadvisor.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:47:52,019 WARN parse.ParseUtil - Unable to successfully
parse content http://www.tripadvisor.com/ of type text/html
2012-12-27 20:47:54,792 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:47:55,927 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:47:58,897 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:47:58,897 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:47:58,898 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:47:58,898 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:01,894 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:02,974 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:03,076 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:48:03,077 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:48:03,077 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:05,963 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:08,959 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:09,953 INFO fetcher.FetcherJob - FetcherJob: threads: 10
2012-12-27 20:48:09,953 INFO fetcher.FetcherJob - FetcherJob: parsing:
false
2012-12-27 20:48:09,953 INFO fetcher.FetcherJob - FetcherJob: resuming:
false
2012-12-27 20:48:09,953 INFO fetcher.FetcherJob - FetcherJob : timelimit
set for : -1
2012-12-27 20:48:09,956 INFO http.Http - http.proxy.host = null
2012-12-27 20:48:09,956 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:48:09,956 INFO http.Http - http.timeout = 10000
2012-12-27 20:48:09,956 INFO http.Http - http.content.limit = 65536
2012-12-27 20:48:09,956 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:09,956 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:09,956 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:10,029 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:13,030 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:13,031 INFO fetcher.FetcherJob - Using queue mode : byHost
2012-12-27 20:48:13,031 INFO fetcher.FetcherJob - Fetcher: threads: 10
2012-12-27 20:48:13,036 INFO fetcher.FetcherJob - Fetcher: throughput
threshold: -1
2012-12-27 20:48:13,036 INFO fetcher.FetcherJob - Fetcher: throughput
threshold sequence: 5
2012-12-27 20:48:13,039 INFO fetcher.FetcherJob - fetching
http://www.booking.com/index.en.html
2012-12-27 20:48:13,040 INFO fetcher.FetcherJob - fetching
http://www.nytimes.com/
2012-12-27 20:48:13,044 INFO http.Http - http.proxy.host = null
2012-12-27 20:48:13,044 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:48:13,044 INFO http.Http - http.timeout = 10000
2012-12-27 20:48:13,044 INFO http.Http - http.content.limit = 65536
2012-12-27 20:48:13,044 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:13,044 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:13,044 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:13,044 INFO fetcher.FetcherJob - QueueFeeder finished:
total 5 records. Hit by time limit :0
2012-12-27 20:48:13,046 INFO http.Http - http.proxy.host = null
2012-12-27 20:48:13,046 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:48:13,047 INFO http.Http - http.timeout = 10000
2012-12-27 20:48:13,047 INFO http.Http - http.content.limit = 65536
2012-12-27 20:48:13,047 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:13,047 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:13,047 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:13,541 INFO fetcher.FetcherJob - fetching
http://www.financialtime.com/
2012-12-27 20:48:13,541 INFO fetcher.FetcherJob - fetching
http://www.tripadvisor.com/
2012-12-27 20:48:14,093 INFO regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:48:18,036 INFO fetcher.FetcherJob - 10/10
spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 307 307 kb/s, 1
URLs in 1 queues
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - * queue:
http://www.booking.com
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - maxThreads = 1
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - inProgress = 0
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - crawlDelay = 5000
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - minCrawlDelay = 0
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - nextFetchTime =
1356641298695
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - now =
1356641298037
2012-12-27 20:48:18,037 INFO fetcher.FetcherJob - 0.
http://www.booking.com/
2012-12-27 20:48:18,703 INFO fetcher.FetcherJob - fetching
http://www.booking.com/
2012-12-27 20:48:18,755 INFO fetcher.FetcherJob - -finishing thread
FetcherThread0, activeThreads=9
2012-12-27 20:48:18,811 INFO regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:48:18,812 INFO fetcher.FetcherJob - -finishing thread
FetcherThread9, activeThreads=8
2012-12-27 20:48:19,052 INFO fetcher.FetcherJob - -finishing thread
FetcherThread5, activeThreads=7
2012-12-27 20:48:19,053 INFO fetcher.FetcherJob - -finishing thread
FetcherThread1, activeThreads=6
2012-12-27 20:48:19,053 INFO fetcher.FetcherJob - -finishing thread
FetcherThread4, activeThreads=5
2012-12-27 20:48:19,053 INFO fetcher.FetcherJob - -finishing thread
FetcherThread6, activeThreads=4
2012-12-27 20:48:19,053 INFO fetcher.FetcherJob - -finishing thread
FetcherThread3, activeThreads=3
2012-12-27 20:48:19,053 INFO fetcher.FetcherJob - -finishing thread
FetcherThread2, activeThreads=2
2012-12-27 20:48:19,073 INFO fetcher.FetcherJob - -finishing thread
FetcherThread7, activeThreads=1
2012-12-27 20:48:19,105 INFO fetcher.FetcherJob - -finishing thread
FetcherThread8, activeThreads=0
2012-12-27 20:48:23,037 INFO fetcher.FetcherJob - 0/0 spinwaiting/active,
5 pages, 0 errors, 0.5 0.2 pages/s, 153 0 kb/s, 0 URLs in 0 queues
2012-12-27 20:48:23,037 INFO fetcher.FetcherJob - -activeThreads=0
2012-12-27 20:48:25,027 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:26,024 INFO parse.ParserJob - ParserJob: resuming: false
2012-12-27 20:48:26,024 INFO parse.ParserJob - ParserJob: forced reparse:
false
2012-12-27 20:48:26,024 INFO parse.ParserJob - ParserJob: parsing all
2012-12-27 20:48:26,041 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:26,115 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:26,122 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:26,123 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:26,131 INFO parse.ParserJob - Parsing
http://www.booking.com/
2012-12-27 20:48:26,136 INFO parse.ParserJob - Parsing
http://www.booking.com/index.en.html
2012-12-27 20:48:26,140 WARN parse.ParseUtil - Error parsing
http://www.booking.com/index.en.html
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:26,141 WARN parse.ParseUtil - Unable to successfully
parse content http://www.booking.com/index.en.html of type text/html
2012-12-27 20:48:26,141 INFO parse.ParserJob - Parsing
http://www.financialtime.com/
2012-12-27 20:48:26,142 WARN parse.ParserJob -
http://www.financialtime.com/ skipped. Content of size 20 was truncated to 0
2012-12-27 20:48:26,142 INFO parse.ParserJob - Parsing
http://www.nytimes.com/
2012-12-27 20:48:26,144 WARN parse.ParseUtil - Error parsing
http://www.nytimes.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:26,144 WARN parse.ParseUtil - Unable to successfully
parse content http://www.nytimes.com/ of type text/html
2012-12-27 20:48:26,145 INFO parse.ParserJob - Parsing
http://www.tripadvisor.com/
2012-12-27 20:48:26,147 WARN parse.ParseUtil - Error parsing
http://www.tripadvisor.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:26,147 WARN parse.ParseUtil - Unable to successfully
parse content http://www.tripadvisor.com/ of type text/html
2012-12-27 20:48:29,094 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:30,164 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:33,158 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:33,158 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:48:33,158 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:48:33,158 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:36,153 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:37,213 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:37,330 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:48:37,330 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:48:37,330 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:48:40,205 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:43,200 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:44,198 INFO fetcher.FetcherJob - FetcherJob: threads: 10
2012-12-27 20:48:44,198 INFO fetcher.FetcherJob - FetcherJob: parsing:
false
2012-12-27 20:48:44,198 INFO fetcher.FetcherJob - FetcherJob: resuming:
false
2012-12-27 20:48:44,198 INFO fetcher.FetcherJob - FetcherJob : timelimit
set for : -1
2012-12-27 20:48:44,200 INFO http.Http - http.proxy.host = null
2012-12-27 20:48:44,200 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:48:44,200 INFO http.Http - http.timeout = 10000
2012-12-27 20:48:44,201 INFO http.Http - http.content.limit = 65536
2012-12-27 20:48:44,201 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:44,201 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:44,201 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:44,292 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:47,279 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:47,280 INFO fetcher.FetcherJob - Using queue mode : byHost
2012-12-27 20:48:47,280 INFO fetcher.FetcherJob - Fetcher: threads: 10
2012-12-27 20:48:47,285 INFO fetcher.FetcherJob - Fetcher: throughput
threshold: -1
2012-12-27 20:48:47,285 INFO fetcher.FetcherJob - Fetcher: throughput
threshold sequence: 5
2012-12-27 20:48:47,288 INFO fetcher.FetcherJob - fetching
http://www.booking.com/
2012-12-27 20:48:47,290 INFO http.Http - http.proxy.host = null
2012-12-27 20:48:47,292 INFO fetcher.FetcherJob - QueueFeeder finished:
total 5 records. Hit by time limit :0
2012-12-27 20:48:47,290 INFO http.Http - http.proxy.port = 8080
2012-12-27 20:48:47,292 INFO http.Http - http.timeout = 10000
2012-12-27 20:48:47,292 INFO http.Http - http.content.limit = 65536
2012-12-27 20:48:47,292 INFO http.Http - http.agent =
ABNutchSpider/Nutch-2.1
2012-12-27 20:48:47,292 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2012-12-27 20:48:47,292 INFO http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2012-12-27 20:48:47,523 INFO regex.RegexURLNormalizer - can't find rules
for scope 'fetcher', using default
2012-12-27 20:48:47,524 INFO fetcher.FetcherJob - fetching
http://www.financialtime.com/
2012-12-27 20:48:47,791 INFO fetcher.FetcherJob - fetching
http://www.nytimes.com/
2012-12-27 20:48:47,792 INFO fetcher.FetcherJob - fetching
http://www.tripadvisor.com/
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - 10/10
spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 204 204 kb/s, 1
URLs in 1 queues
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - * queue:
http://www.booking.com
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - maxThreads = 1
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - inProgress = 0
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - crawlDelay = 5000
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - minCrawlDelay = 0
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - nextFetchTime =
1356641332521
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - now =
1356641332285
2012-12-27 20:48:52,285 INFO fetcher.FetcherJob - 0.
http://www.booking.com/index.en.html
2012-12-27 20:48:52,541 INFO fetcher.FetcherJob - fetching
http://www.booking.com/index.en.html
2012-12-27 20:48:52,565 INFO fetcher.FetcherJob - -finishing thread
FetcherThread8, activeThreads=9
2012-12-27 20:48:52,566 INFO fetcher.FetcherJob - -finishing thread
FetcherThread7, activeThreads=8
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread6, activeThreads=7
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread2, activeThreads=6
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread4, activeThreads=5
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread0, activeThreads=4
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread5, activeThreads=3
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread1, activeThreads=2
2012-12-27 20:48:52,802 INFO fetcher.FetcherJob - -finishing thread
FetcherThread3, activeThreads=1
2012-12-27 20:48:53,419 INFO fetcher.FetcherJob - -finishing thread
FetcherThread9, activeThreads=0
2012-12-27 20:48:57,286 INFO fetcher.FetcherJob - 0/0 spinwaiting/active,
5 pages, 0 errors, 0.5 0.2 pages/s, 153 102 kb/s, 0 URLs in 0 queues
2012-12-27 20:48:57,286 INFO fetcher.FetcherJob - -activeThreads=0
2012-12-27 20:48:59,279 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:48:59,280 INFO parse.ParserJob - ParserJob: resuming: false
2012-12-27 20:48:59,280 INFO parse.ParserJob - ParserJob: forced reparse:
false
2012-12-27 20:48:59,280 INFO parse.ParserJob - ParserJob: parsing all
2012-12-27 20:48:59,295 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:59,364 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:48:59,371 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:48:59,372 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-12-27 20:48:59,384 INFO parse.ParserJob - Parsing
http://www.booking.com/
2012-12-27 20:48:59,385 INFO parse.ParserJob - Parsing
http://www.booking.com/index.en.html
2012-12-27 20:48:59,390 WARN parse.ParseUtil - Error parsing
http://www.booking.com/index.en.html
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:59,391 WARN parse.ParseUtil - Unable to successfully
parse content http://www.booking.com/index.en.html of type text/html
2012-12-27 20:48:59,391 INFO parse.ParserJob - Parsing
http://www.financialtime.com/
2012-12-27 20:48:59,392 WARN parse.ParserJob -
http://www.financialtime.com/ skipped. Content of size 20 was truncated to 0
2012-12-27 20:48:59,392 INFO parse.ParserJob - Parsing
http://www.nytimes.com/
2012-12-27 20:48:59,394 WARN parse.ParseUtil - Error parsing
http://www.nytimes.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:59,394 WARN parse.ParseUtil - Unable to successfully
parse content http://www.nytimes.com/ of type text/html
2012-12-27 20:48:59,395 INFO parse.ParserJob - Parsing
http://www.tripadvisor.com/
2012-12-27 20:48:59,398 WARN parse.ParseUtil - Error parsing
http://www.tripadvisor.com/
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/cyberneko/html/HTMLComponent
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
at java.util.concurrent.FutureTask.get(FutureTask.java:91)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NoClassDefFoundError: org/cyberneko/html/HTMLComponent
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at
org.cyberneko.html.parsers.DOMFragmentParser.<init>(DOMFragmentParser.java:127)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:255)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.cyberneko.html.HTMLComponent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 24 more
2012-12-27 20:48:59,398 WARN parse.ParseUtil - Unable to successfully
parse content http://www.tripadvisor.com/ of type text/html
2012-12-27 20:49:02,346 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
2012-12-27 20:49:03,416 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-12-27 20:49:06,406 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-12-27 20:49:06,407 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-12-27 20:49:06,407 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-12-27 20:49:06,407 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2012-12-27 20:49:09,403 WARN mapred.FileOutputCommitter - Output path is
null in cleanup
------------------------------------------------------------------------
RE: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Markus Jelsma <ma...@openindex.io>.
Can't see them either.
-----Original message-----
> From:Tejas Patil <te...@gmail.com>
> Sent: Fri 04-Jan-2013 10:27
> To: user@nutch.apache.org
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents
>
> Is it just me or nobody else cant see the images attached inline by Arcondo
> ?
>
>
> On Fri, Jan 4, 2013 at 1:18 AM, Tejas Patil <te...@gmail.com>wrote:
>
> >
> >
> >
> > On Thu, Jan 3, 2013 at 10:38 PM, Arcondo Dasilva <
> > arcondo.dasilva@gmail.com> wrote:
> >
> >> Hi Lewis,
> >>
> >> Thanks for your feedback. I went through the process step by step and I'm
> >> still getting the error :
> >>
> >> my plugins folder looks like this :
> >>
> >> [image: Inline image 1]
> >>
> >> When I ran the parse job it gave me this :
> >>
> >> [image: Inline image 2]
> >>
> >> when I look at the log file, I get this :
> >>
> >> [image: Inline image 3]
> >>
> >> My nutch-site.xml contains this :
> >>
> >> <property>
> >> <name>plugin.includes</name>
> >>
> >> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> >> <description>Regular expression naming plugin directory names to
> >> include. Any plugin not matching this expression is excluded.
> >> In any case you need at least include the nutch-extensionpoints plugin.
> >> By
> >> default Nutch includes crawling just HTML and plain text via HTTP,
> >> and basic indexing and search plugins. In order to use HTTPS please
> >> enable
> >> protocol-httpclient, but be aware of possible intermittent problems
> >> with the
> >> underlying commons-httpclient library.
> >> </description>
> >> </property>
> >>
> >>
> >> am I missing something else ?
> >>
> >> Thanks for your precious help.
> >>
> >> Arcondo.
> >>
> >>
> >>
> >> On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
> >> lewis.mcgibbney@gmail.com> wrote:
> >>
> >>> Hi Arcondo,
> >>>
> >>> The nekohtml jar should be version 0.9.5, and should reside in
> >>> build/plugins/lib-nekohtml once you build Nutch from source.
> >>> Once you use the default 'runtime' target, the corresponding plugins
> >>> folders should be copied into runtime/local/plugins
> >>> Can you check that the jar is copied to this directory before attempting
> >>> to
> >>> parse th6e URLs in your segment(s) if using 1.x.
> >>> I'm also assuming that you have parse-html included in the
> >>> plugin.includes
> >>> property within nutch-site.xml before building the source.
> >>>
> >>> Lewis
> >>>
> >>> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
> >>> <ar...@gmail.com>wrote:
> >>>
> >>> > Thanks for the explanation. I'm more a functional guy with no solid
> >>> > background in Java.
> >>> > Could you give some details on how to enforce it manually ?
> >>> >
> >>> > Thanks in advance, Arcondo
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
> >>> > lewis.mcgibbney@gmail.com> wrote:
> >>> >
> >>> > > the jar is not on the classpath
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>
> >>
> >
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by al...@aim.com.
Hi,
You can unjar the jar file, check if the class that parse complains about is inside it. You can also try to put content of jar file under local /lib. Maybe there is some read restriction. If this does not help, I can only suggest to start again with a new copy of nutch.
Alex.
-----Original Message-----
From: Arcondo Dasilva <ar...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Sat, Jan 5, 2013 1:11 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents
Hi Alex,
I'm using 2.1 version / hbase 0.90.6 / solr 4.0
everything works fine except I'm not able to parse the contents of my url
because of the error Nekohtml not found.
my plugins include looks like this :
<value>protocol-http|urlfilter-regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|lib-nekohtml</value>
I added lib-nekohtml at the end of the allowed values but seems that has
no effect on the error.
in my runtime/local/plugins/lib-nekohtml, I have the jar file
: nekohtml-0.9.5.jar
is there something I should look for beside this ?
Thanks a lot for your help.
Kr, Arcondo
On Fri, Jan 4, 2013 at 11:33 PM, <al...@aim.com> wrote:
> Which version of nutch is this? Did you follow the tutorial? I can help
> yuu if you provide all steps you did, starting with downloading nutch.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo Dasilva <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 1:23 pm
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hi Alex,
>
> I tried. That was the first thing I did but without success.
> I don't understand why I'm obliged to use Neko instead of Tika. As far as I
> know tika can parse more than 1200 different formats
>
> Kr, Arcondo
>
>
> On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:
>
> > move or copy that jar file to local/lib and try again.
> >
> > hth.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Arcondo <ar...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Fri, Jan 4, 2013 2:55 am
> > Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> > contents
> >
> >
> > Hope that now you can see them
> >
> > Plugin folder
> > <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
> >
> > Parse Job
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
> >
> > Parse error : Hadoop.log
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
> >
> > My nutch-site.xm (plugin includes)
> >
> > <property>
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> > <description>Regular expression naming plugin directory names to
> > include. Any plugin not matching this expression is excluded.
> > In any case you need at least include the nutch-extensionpoints plugin.
> > By default Nutch includes crawling just HTML and plain text via HTTP,
> > and basic indexing and search plugins. In order to use HTTPS please
> > enable
> > protocol-httpclient, but be aware of possible intermittent problems
> > with the
> > underlying commons-httpclient library.
> > </description>
> > </property>
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
>
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Arcondo Dasilva <ar...@gmail.com>.
Hi Alex,
I'm using 2.1 version / hbase 0.90.6 / solr 4.0
everything works fine except I'm not able to parse the contents of my url
because of the error Nekohtml not found.
my plugins include looks like this :
<value>protocol-http|urlfilter-regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|lib-nekohtml</value>
I added lib-nekohtml at the end of the allowed values but seems that has
no effect on the error.
in my runtime/local/plugins/lib-nekohtml, I have the jar file
: nekohtml-0.9.5.jar
is there something I should look for beside this ?
Thanks a lot for your help.
Kr, Arcondo
On Fri, Jan 4, 2013 at 11:33 PM, <al...@aim.com> wrote:
> Which version of nutch is this? Did you follow the tutorial? I can help
> yuu if you provide all steps you did, starting with downloading nutch.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo Dasilva <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 1:23 pm
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hi Alex,
>
> I tried. That was the first thing I did but without success.
> I don't understand why I'm obliged to use Neko instead of Tika. As far as I
> know tika can parse more than 1200 different formats
>
> Kr, Arcondo
>
>
> On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:
>
> > move or copy that jar file to local/lib and try again.
> >
> > hth.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Arcondo <ar...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Fri, Jan 4, 2013 2:55 am
> > Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> > contents
> >
> >
> > Hope that now you can see them
> >
> > Plugin folder
> > <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
> >
> > Parse Job
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
> >
> > Parse error : Hadoop.log
> >
> > <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
> >
> > My nutch-site.xm (plugin includes)
> >
> > <property>
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> > <description>Regular expression naming plugin directory names to
> > include. Any plugin not matching this expression is excluded.
> > In any case you need at least include the nutch-extensionpoints plugin.
> > By default Nutch includes crawling just HTML and plain text via HTTP,
> > and basic indexing and search plugins. In order to use HTTPS please
> > enable
> > protocol-httpclient, but be aware of possible intermittent problems
> > with the
> > underlying commons-httpclient library.
> > </description>
> > </property>
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
>
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by al...@aim.com.
Which version of nutch is this? Did you follow the tutorial? I can help yuu if you provide all steps you did, starting with downloading nutch.
Alex.
-----Original Message-----
From: Arcondo Dasilva <ar...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Jan 4, 2013 1:23 pm
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents
Hi Alex,
I tried. That was the first thing I did but without success.
I don't understand why I'm obliged to use Neko instead of Tika. As far as I
know tika can parse more than 1200 different formats
Kr, Arcondo
On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:
> move or copy that jar file to local/lib and try again.
>
> hth.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 2:55 am
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hope that now you can see them
>
> Plugin folder
> <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
>
> Parse Job
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
>
> Parse error : Hadoop.log
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
>
> My nutch-site.xm (plugin includes)
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin.
> By default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please
> enable
> protocol-httpclient, but be aware of possible intermittent problems
> with the
> underlying commons-httpclient library.
> </description>
> </property>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Arcondo Dasilva <ar...@gmail.com>.
Hi Alex,
I tried. That was the first thing I did but without success.
I don't understand why I'm obliged to use Neko instead of Tika. As far as I
know tika can parse more than 1200 different formats
Kr, Arcondo
On Fri, Jan 4, 2013 at 7:47 PM, <al...@aim.com> wrote:
> move or copy that jar file to local/lib and try again.
>
> hth.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Arcondo <ar...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Fri, Jan 4, 2013 2:55 am
> Subject: Re: Native Hadoop library not loaded and Cannot parse sites
> contents
>
>
> Hope that now you can see them
>
> Plugin folder
> <http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
>
> Parse Job
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
>
> Parse error : Hadoop.log
>
> <http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
>
> My nutch-site.xm (plugin includes)
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin.
> By default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please
> enable
> protocol-httpclient, but be aware of possible intermittent problems
> with the
> underlying commons-httpclient library.
> </description>
> </property>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by al...@aim.com.
move or copy that jar file to local/lib and try again.
hth.
Alex.
-----Original Message-----
From: Arcondo <ar...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Fri, Jan 4, 2013 2:55 am
Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents
Hope that now you can see them
Plugin folder
<http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
Parse Job
<http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
Parse error : Hadoop.log
<http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
My nutch-site.xm (plugin includes)
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
By default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
--
View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Native Hadoop library not loaded and Cannot parse sites
contents
Posted by Arcondo <ar...@gmail.com>.
Hope that now you can see them
Plugin folder
<http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png>
Parse Job
<http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png>
Parse error : Hadoop.log
<http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png>
My nutch-site.xm (plugin includes)
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
By default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
--
View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Tejas Patil <te...@gmail.com>.
Is it just me or nobody else cant see the images attached inline by Arcondo
?
On Fri, Jan 4, 2013 at 1:18 AM, Tejas Patil <te...@gmail.com>wrote:
>
>
>
> On Thu, Jan 3, 2013 at 10:38 PM, Arcondo Dasilva <
> arcondo.dasilva@gmail.com> wrote:
>
>> Hi Lewis,
>>
>> Thanks for your feedback. I went through the process step by step and I'm
>> still getting the error :
>>
>> my plugins folder looks like this :
>>
>> [image: Inline image 1]
>>
>> When I ran the parse job it gave me this :
>>
>> [image: Inline image 2]
>>
>> when I look at the log file, I get this :
>>
>> [image: Inline image 3]
>>
>> My nutch-site.xml contains this :
>>
>> <property>
>> <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>> <description>Regular expression naming plugin directory names to
>> include. Any plugin not matching this expression is excluded.
>> In any case you need at least include the nutch-extensionpoints plugin.
>> By
>> default Nutch includes crawling just HTML and plain text via HTTP,
>> and basic indexing and search plugins. In order to use HTTPS please
>> enable
>> protocol-httpclient, but be aware of possible intermittent problems
>> with the
>> underlying commons-httpclient library.
>> </description>
>> </property>
>>
>>
>> am I missing something else ?
>>
>> Thanks for your precious help.
>>
>> Arcondo.
>>
>>
>>
>> On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi Arcondo,
>>>
>>> The nekohtml jar should be version 0.9.5, and should reside in
>>> build/plugins/lib-nekohtml once you build Nutch from source.
>>> Once you use the default 'runtime' target, the corresponding plugins
>>> folders should be copied into runtime/local/plugins
>>> Can you check that the jar is copied to this directory before attempting
>>> to
>>> parse th6e URLs in your segment(s) if using 1.x.
>>> I'm also assuming that you have parse-html included in the
>>> plugin.includes
>>> property within nutch-site.xml before building the source.
>>>
>>> Lewis
>>>
>>> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
>>> <ar...@gmail.com>wrote:
>>>
>>> > Thanks for the explanation. I'm more a functional guy with no solid
>>> > background in Java.
>>> > Could you give some details on how to enforce it manually ?
>>> >
>>> > Thanks in advance, Arcondo
>>> >
>>> >
>>> >
>>> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
>>> > lewis.mcgibbney@gmail.com> wrote:
>>> >
>>> > > the jar is not on the classpath
>>> >
>>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Tejas Patil <te...@gmail.com>.
On Thu, Jan 3, 2013 at 10:38 PM, Arcondo Dasilva
<ar...@gmail.com>wrote:
> Hi Lewis,
>
> Thanks for your feedback. I went through the process step by step and I'm
> still getting the error :
>
> my plugins folder looks like this :
>
> [image: Inline image 1]
>
> When I ran the parse job it gave me this :
>
> [image: Inline image 2]
>
> when I look at the log file, I get this :
>
> [image: Inline image 3]
>
> My nutch-site.xml contains this :
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin.
> By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please
> enable
> protocol-httpclient, but be aware of possible intermittent problems with
> the
> underlying commons-httpclient library.
> </description>
> </property>
>
>
> am I missing something else ?
>
> Thanks for your precious help.
>
> Arcondo.
>
>
>
> On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Arcondo,
>>
>> The nekohtml jar should be version 0.9.5, and should reside in
>> build/plugins/lib-nekohtml once you build Nutch from source.
>> Once you use the default 'runtime' target, the corresponding plugins
>> folders should be copied into runtime/local/plugins
>> Can you check that the jar is copied to this directory before attempting
>> to
>> parse th6e URLs in your segment(s) if using 1.x.
>> I'm also assuming that you have parse-html included in the plugin.includes
>> property within nutch-site.xml before building the source.
>>
>> Lewis
>>
>> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
>> <ar...@gmail.com>wrote:
>>
>> > Thanks for the explanation. I'm more a functional guy with no solid
>> > background in Java.
>> > Could you give some details on how to enforce it manually ?
>> >
>> > Thanks in advance, Arcondo
>> >
>> >
>> >
>> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
>> > lewis.mcgibbney@gmail.com> wrote:
>> >
>> > > the jar is not on the classpath
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Arcondo Dasilva <ar...@gmail.com>.
Hi Lewis,
Thanks for your feedback. I went through the process step by step and I'm
still getting the error :
my plugins folder looks like this :
[image: Inline image 1]
When I ran the parse job it gave me this :
[image: Inline image 2]
when I look at the log file, I get this :
[image: Inline image 3]
My nutch-site.xml contains this :
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
am I missing something else ?
Thanks for your precious help.
Arcondo.
On Thu, Jan 3, 2013 at 11:20 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Hi Arcondo,
>
> The nekohtml jar should be version 0.9.5, and should reside in
> build/plugins/lib-nekohtml once you build Nutch from source.
> Once you use the default 'runtime' target, the corresponding plugins
> folders should be copied into runtime/local/plugins
> Can you check that the jar is copied to this directory before attempting to
> parse th6e URLs in your segment(s) if using 1.x.
> I'm also assuming that you have parse-html included in the plugin.includes
> property within nutch-site.xml before building the source.
>
> Lewis
>
> On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
> <ar...@gmail.com>wrote:
>
> > Thanks for the explanation. I'm more a functional guy with no solid
> > background in Java.
> > Could you give some details on how to enforce it manually ?
> >
> > Thanks in advance, Arcondo
> >
> >
> >
> > On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > the jar is not on the classpath
> >
>
>
>
> --
> *Lewis*
>
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Arcondo,
The nekohtml jar should be version 0.9.5, and should reside in
build/plugins/lib-nekohtml once you build Nutch from source.
Once you use the default 'runtime' target, the corresponding plugins
folders should be copied into runtime/local/plugins
Can you check that the jar is copied to this directory before attempting to
parse th6e URLs in your segment(s) if using 1.x.
I'm also assuming that you have parse-html included in the plugin.includes
property within nutch-site.xml before building the source.
Lewis
On Thu, Jan 3, 2013 at 9:11 PM, Arcondo Dasilva
<ar...@gmail.com>wrote:
> Thanks for the explanation. I'm more a functional guy with no solid
> background in Java.
> Could you give some details on how to enforce it manually ?
>
> Thanks in advance, Arcondo
>
>
>
> On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > the jar is not on the classpath
>
--
*Lewis*
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Arcondo Dasilva <ar...@gmail.com>.
Thanks for the explanation. I'm more a functional guy with no solid
background in Java.
Could you give some details on how to enforce it manually ?
Thanks in advance, Arcondo
On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> the jar is not on the classpath
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Arcondo,
As Tejas pointed out, the jar is not on the classpath. This should be
automated by the Ant and Ivy configuration in Nutch however if it is not
then simply manually enforce it.
Lewis
On Wed, Jan 2, 2013 at 9:43 PM, Arcondo <ar...@gmail.com> wrote:
> Hello,
>
> I made an "ant clean" and then I rebuild and still getting the same issue.
> I checked in my ivy2 folder :
>
> <http://lucene.472066.n3.nabble.com/file/n4030135/nekohtml.png>
>
>
> and I still getting : java.lang.ClassNotFoundException:
> *org.cyberneko.html.HTMLComponent*
>
> any other insights ?
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030135.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*
Re: Native Hadoop library not loaded and Cannot parse sites
contents
Posted by Arcondo <ar...@gmail.com>.
Hello,
I made an "ant clean" and then I rebuild and still getting the same issue.
I checked in my ivy2 folder :
<http://lucene.472066.n3.nabble.com/file/n4030135/nekohtml.png>
and I still getting : java.lang.ClassNotFoundException:
*org.cyberneko.html.HTMLComponent*
any other insights ?
Thanks,
--
View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030135.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Native Hadoop library not loaded and Cannot parse sites contents
Posted by Tejas Patil <te...@gmail.com>.
The exception indicates that the nekohtml jar is not present. In case you
are using the source distribution, do an "ant clean" and then build again
in shell. The nekohtml jar must be present at location
{$USER_HOME}/.ivy2/cache/nekohtml/nekohtml/jars.
Thanks,
Tejas Patil
On Thu, Dec 27, 2012 at 1:26 PM, Arcondo Dasilva
<ar...@gmail.com>wrote:
> org/cyberneko/html/HTMLComponent
>