You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2009/02/07 14:20:27 UTC

Re: Nutch Post-Processing

(moving this to nutch-user - nutch-agent is for reporting 
abuse/misbehavior of Nutch-based crawlers)

John Crepezzi wrote:
> I'm interested in writing an application that analyzes sources every 
> time they are updated,
> and uses the parsedText, tags, title, etc to perform some operations and 
> export the finished data to
> a database.
> 
> I'm not sure where this application should be placed within nutch/lucene,
> so any advice anyone can offer would be greatly appreciated.
> 
> I thought plugins would work for me, but I'm unable to find an extension 
> point that will give me access
> to the parsed data and tag sets.

This issue comes up occasionally, but so far no one was desperate enough 
to work out a patch ;)

You can define an additional extension point (please see how eg. 
HtmlParseFilter extension is designed - perhaps this extension is all 
you need?), and invoke this new extension point right after you parse 
the content, so that you can access both the content and the parsed 
data/text even before it's recorded in a segment.

The best place to put this hook would be in ParseUtil class, because 
that's what other Nutch tools use to parse the content.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch Post-Processing

Posted by Doğacan Güney <do...@gmail.com>.

On Sat, Feb 7, 2009 at 3:20 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> (moving this to nutch-user - nutch-agent is for reporting abuse/misbehavior
> of Nutch-based crawlers)
>
> John Crepezzi wrote:
>>
>> I'm interested in writing an application that analyzes sources every time
>> they are updated,
>> and uses the parsedText, tags, title, etc to perform some operations and
>> export the finished data to
>> a database.
>>
>> I'm not sure where this application should be placed within nutch/lucene,
>> so any advice anyone can offer would be greatly appreciated.
>>
>> I thought plugins would work for me, but I'm unable to find an extension
>> point that will give me access
>> to the parsed data and tag sets.
>
> This issue comes up occasionally, but so far no one was desperate enough to
> work out a patch ;)
>
> You can define an additional extension point (please see how eg.
> HtmlParseFilter extension is designed - perhaps this extension is all you
> need?), and invoke this new extension point right after you parse the
> content, so that you can access both the content and the parsed data/text
> even before it's recorded in a segment.
>
> The best place to put this hook would be in ParseUtil class, because that's
> what other Nutch tools use to parse the content.
>
>

Another way to do it is to use the new NutchIndexWriter-s. You can add a new
DBIndexWriter (like SolrIndexWriter or LuceneIndexWriter) then modify the
index process (or add a new Indexer) to push documents to your database.

> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney