You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2011/02/23 14:12:57 UTC

Database storage solution then DIH to Solr... or clean post to Solr from Nutch crawl

Hi list,

This is a question I am hoping will prompt some decent input. I have been trying to build Nutch trunk recently and have been pondering on whether or not to store large data in back-end HBase or MySQL then utilise DIH to import to Solr for search capabilities... or to pass the solrindex command within the crawl process to send data directly to Solr, effectively removing back-end database storage altogether. At this stage I do not quite know how 'large' data will get, as this idea is still in development but to give an example we wish to implement a prototype system which will crawl a local authority Intranet site, if reasonable results can be achieved then we can progress to the other 31 local authorities throughout the country. In the latter case I expect that data volumes could be classed as huge. I was wondering if anyone can provide insight into the pro's con's of both of the approaches and if possible any examples of production implementations relating to comparisons. I realise that there are a couple of questions here and do not wish to get definitive answers to all/any of them, but it would be great to receive any feedback on this topic.

Thank you Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html