You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tigran Tsaturyan <ti...@gmail.com> on 2015/01/02 17:11:05 UTC

Nutch crawling advice

Hello,

 

I am a new user of Nutch and though looked through several manuals on the
web, I still have questions. Hope you will be able to give answers or point
me to some manual.

My questions:

.         I intend to use Nutch to crawl several particular sites and (as I
know data structure inside it) want to extract particular information
(fields). After some processing, I want to dump data into elasticsearch/sql.
Can you recommend some solution here? Plugins? I know that there is plugin
for extraction data pieces based on Xpath, but not sure that it will be
flexible enough for my need. My thought was to dump raw html into Sorl and
use some kind of batch parser on python or other language that will query
for html, process it and then dump into elasticsearch/sql.

.         How can I get raw HTML? This is top question according to google
search and there are answers like: write your own plugin, grab data from
crawldb, grab links and then download html via additional software. What is
your recommended way of doing so?

.         At your presentation you mentioned that you cannot guarantee low
latency. Can new page be crawled for example within 1 day. Is it doable? I
am targeting 10 sites with more than 100k pages each and that are updated
constantly.

 

Thank you for sharing your experience.

 

Best Regards, Tigran.