You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nico Sabbi <ns...@officinedigitali.it> on 2008/08/06 12:36:03 UTC
How to use the summarizer and the highlighter?
Hi,
I'm using nutch 0.9 as a replacement for htdig, thus I'd like to have
all the features of htdig in the results, including the excerpt/summary.
I read that there's a summary-lucene plugin that should do what I need,
but I don't know if I'm actually using it or used it incorrectly
because I still can't get what I need.
I enabled (hopefully) the plugin in conf/nutch-size.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.includes</name><value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)|summary-lucene|summary-basic</value>
</property>
</configuration>
Then I ran my crawl command:
nico@nico2:~/nutch-0.9> ./bin/nutch crawl bin/urls -depth 3 -dir vdb
crawl started in: vdb
rootUrlDir = bin/urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: vdb/crawldb
Injector: urlDir: bin/urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
while without the <property> block in nutch-site.xml nutch runs without
problems.
I also tried to run the plugin from the command line, but nutch fails
complaining
that the summarizer doesn't have a main().
Can you please tell me what I did wrong and post an exact command line,
please?
I find the documentation quite confusing and I can't go any further just
reading it.
Thanks,
Nico