You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2015/06/29 16:20:14 UTC
Architectural advice & questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.
Please bear with me here, I'm pretty new to Solr with most of me DB
experience being of the relational variety. I'm planning a new project,
which I believe Solr (and Nutch) will solve well. Although I've
installed Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit,
I'd be grateful for advice and tips regarding my plan.
I'm looking to build a vertical search engine to cover a very specific
and narrow dataset. Sources will number in the hundreds and mostly
managed by hand, these will be a mixture of forums and product based
e-commerce sites. For some of these I was hoping to leverage the SOLR
DataImportHandler system with their RSS feeds primarily for the ease of
acquiring clean, reasonably sanitised and well structured data. For the
rest, I'm going to fall back to Nutch crawling them, with some heavy
regulation via Regex of urls. So to sum up, a Solr DB populated through
a couple of different ways, then search via some custom user facing PHP
webpages. Finally a cronjob script would delete any docs older than X
weeks, to keep on top of data retention.
Does that sound sensible at all?
Regarding RSS feeds:-
Many only provide a limited number of recent items, however I'd like to
retain items for many weeks. I've already discovered the clean=false
param on DataImport, after wondering why old rss items vanished!
Question 1) is there an easy way to filter items to import in the
URLDataSource entity? Or is it best to go down route of XSLT
preprocessing?
Question 2) Multiple URLDataSources: reference all in one DataImport
handler? Or have multiple DataImport handlers?
What's the best approach to supplement imported data with additional
static fields/keywords based associated with the source feed or crawled
site? e.g. all docs from sites A, B & C are of subcategory Foo. I'm
guessing with RSS feeds this would be straightforward via the XSLT
preprocessor. But for Nutch submitted docs - I've no idea?
Scheduling import: Do people just cron up a curl post request (or shell
execute of Nutch crawl script)? Or is there a more elegant solution
available?
Any other more general tips and advice on the above greatly appreciated.
--
Arthur Yarwood