You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nico Sabbi <ns...@officinedigitali.it> on 2008/08/06 12:36:03 UTC

How to use the summarizer and the highlighter?

Hi,
I'm using nutch 0.9 as a replacement for htdig, thus I'd like to have
all the features of htdig in the results, including the excerpt/summary.

I read that there's a summary-lucene plugin that should do what I need,
but I don't know if I'm actually using it or used it incorrectly
because I still can't get what I need.

I enabled (hopefully) the plugin in conf/nutch-size.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
      <property>
            
<name>plugin.includes</name><value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)|summary-lucene|summary-basic</value>
      </property>
</configuration>

Then I ran my crawl command:

nico@nico2:~/nutch-0.9> ./bin/nutch crawl bin/urls -depth 3 -dir vdb
crawl started in: vdb
rootUrlDir = bin/urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: vdb/crawldb
Injector: urlDir: bin/urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

while  without the <property> block in nutch-site.xml nutch runs without 
problems.

I also tried to run the plugin from the command line, but nutch fails 
complaining
that the summarizer doesn't have a main().

Can you please tell me what I did wrong and post an exact command line, 
please?
I find the documentation quite confusing and I can't go any further just 
reading it.

Thanks,
    Nico