You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2015/06/29 16:20:14 UTC

Architectural advice & questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.

Please bear with me here, I'm pretty new to Solr with most of me DB 
experience being of the relational variety. I'm planning a new project, 
which I believe Solr (and Nutch) will solve well. Although I've 
installed Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit, 
I'd be grateful for advice and tips regarding my plan.

I'm looking to build a vertical search engine to cover a very specific 
and narrow dataset. Sources will number in the hundreds and mostly 
managed by hand, these will be a mixture of forums and product based 
e-commerce sites. For some of these I was hoping to leverage the SOLR 
DataImportHandler system with their RSS feeds primarily for the ease of 
acquiring clean, reasonably sanitised and well structured data. For the 
rest, I'm going to fall back to Nutch crawling them, with some heavy 
regulation via Regex of urls. So to sum up, a Solr DB populated through 
a couple of different ways, then search via some custom user facing PHP 
webpages. Finally a cronjob script would delete any docs older than X 
weeks, to keep on top of data retention.

Does that sound sensible at all?

Regarding RSS feeds:-
Many only provide a limited number of recent items, however I'd like to 
retain items for many weeks. I've already discovered the clean=false 
param on DataImport, after wondering why old rss items vanished!
Question 1) is there an easy way to filter items to import in the 
URLDataSource entity? Or is it best to go down route of XSLT 
preprocessing?
Question 2) Multiple URLDataSources: reference all in one DataImport 
handler? Or have multiple DataImport handlers?

What's the best approach to supplement imported data with additional 
static fields/keywords based associated with the source feed or crawled 
site? e.g. all docs from sites A, B & C are of subcategory Foo. I'm 
guessing with RSS feeds this would be straightforward via the XSLT 
preprocessor. But for Nutch submitted docs - I've no idea?

Scheduling import: Do people just cron up a curl post request (or shell 
execute of Nutch crawl script)? Or is there a more elegant solution 
available?

Any other more general tips and advice on the above greatly appreciated.

-- 
Arthur Yarwood