You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by astro <al...@gmail.com> on 2015/07/18 16:18:35 UTC

[nutch] how to extract in 2 steps and observer

Hello,
I do not know if this is achievable with nutch. And if so I do not find how.
I searched the plugin side but without success.


Goal:
Consituter a private basis of information concerning a dozen siter to index.
Each site has hundreds of thousands of pages.

I want to store the original pages in an index
then on the basis of pages storing apply an extraction related to certain
tags to form a second working index. This changes the extraction script and
remake an index without having to recontact the base site and the problem of
respect robots.txt

In addition each site indexed to a variable structure so the script is
different.

The database storage will nosql hadoop
the second index will synchronize with ElasticSearch


Last
to index the sites have a page that lists the latest addition the entry
point of the last inserted or changed pages.
So how to scan this page and index / re-index the useful pages. Otherwise he
will always go up in the tree for each passage (5 min) while only a few
dirty pages.

Thank you



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-how-to-extract-in-2-steps-and-observer-tp4218025.html
Sent from the Nutch - User mailing list archive at Nabble.com.