You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Becky Neil <be...@lovemachineinc.com> on 2012/01/25 23:38:28 UTC

Advice - evaluating Solr for categorization & keyword search

Hi all,
I've been tasked with evaluating whether Solr is the right solution for my
company's search needs.  If this isn't the right forum for this kind of
question, please let me know where to go instead!

We are currently using sql queries to find mysql db results that match a
single keyword in one short text field, so our search is pretty crude.

What we hope that Solr can do initially is:
1 enable more flexible search (booleans, more than one field
searched/matched, etc)
2 live search results (eg new records get added to the index upon creation)
3 search rankings (eg most relevant -> least relevant)
4 categorize our db (take records and at least group them, better if it
could assign a label to each record)
5 locate nearby results (geospatial search)

What I hope you can advise on is:
A How would you go about #2 - making sure that new documents are
added/indexed asap, based on a new rows to the db? Is that as simple as a
setting in Solr, or does it take some coding (eg a listener object, a kron
job, etc.).  I tried looking at the wiki & tutorial but wasn't able to find
answers - I couldn't make sense of how to use UpdateRequestProcessor to do
it. (http://wiki.apache.org/solr/UpdateRequestProcessor)
B What's the status of document clustering? The wiki says it's not been
fully implemented. Would we be able to achieve any of #4 yet? If not, what
else should we consider?
C Would you use Solr over say Google Maps api to run location aware
searches?
D How long should we expect it to take to configure Solr on our servers
with our db, get the initial index set up, and enable live search results?
 Are we talking one week, or one month? Our db is not tiny, but it's not
huge - say around 8k records in each of ~20 tables. Most tables have around
10 fields, including at least one large text field and then a variety of
dates, numbers, and small text.

I really appreciate any advice you can offer!
Cheers,
Becky
http://www.coffeeandpower.com

Re: Advice - evaluating Solr for categorization & keyword search

Posted by Erick Erickson <er...@gmail.com>.
See below...

On Wed, Jan 25, 2012 at 2:38 PM, Becky Neil <be...@lovemachineinc.com> wrote:
> Hi all,
> I've been tasked with evaluating whether Solr is the right solution for my
> company's search needs.  If this isn't the right forum for this kind of
> question, please let me know where to go instead!
>
> We are currently using sql queries to find mysql db results that match a
> single keyword in one short text field, so our search is pretty crude.
>
Be a little careful here. Often, when people come from a DB background
they think in terms of normalized data. If each of your tables is
independent of all other tables, then the simple "map the rows into
documents" approach works. More likely, you'll combine bits from
several tables into each Solr document and your reflexive distaste
for de-normalizing data will trip you up. Get over it <G>......

> What we hope that Solr can do initially is:
> 1 enable more flexible search (booleans, more than one field
> searched/matched, etc)
This is OOB functionality. But do note that Solr/Lucene query
parsing is not a true boolean process, see:
http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/

> 2 live search results (eg new records get added to the index upon creation)
As you indicated below, you'd need some process that noticed that
your DB changed and then indexed the changed records. Once the
records are indexed, Solr will pick up the changes automatically
but you have to control the indexing process from outside.

> 3 search rankings (eg most relevant -> least relevant)
OOB functionality with lots of knobs to turn for tuning. See
edismax

> 4 categorize our db (take records and at least group them, better if it
> could assign a label to each record)
Depending on what the details are here, this may be OOB. See
faceting and grouping/field collapsing. See:
http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/FieldCollapsing

> 5 locate nearby results (geospatial search)
OOB, although you need to store the lat/lon. See:
http://wiki.apache.org/solr/SpatialSearch
>
> What I hope you can advise on is:
> A How would you go about #2 - making sure that new documents are
> added/indexed asap, based on a new rows to the db? Is that as simple as a
> setting in Solr, or does it take some coding (eg a listener object, a kron
> job, etc.).  I tried looking at the wiki & tutorial but wasn't able to find
> answers - I couldn't make sense of how to use UpdateRequestProcessor to do
> it. (http://wiki.apache.org/solr/UpdateRequestProcessor)
What you'll be doing here is either using Data Import Handler or
SolrJ (Java client) to push solr documents into Solr. This is
straight-forward once you know the magic. A trivial SolrJ program
that indexes documents from a DB is maybe 100 lines, including
imports. It *uses* the updatehandler, but you don't see that, you see
something like solrServer.add(ListOfSolrInputDocuments);

> B What's the status of document clustering? The wiki says it's not been
> fully implemented. Would we be able to achieve any of #4 yet? If not, what
> else should we consider?
I don't think you're really thinking about document clustering here. I suspect
that grouping and/or faceting will be where you start. At least I'd look at
that first although clustering may be exactly what you want. Half the battle
is learning the right vocabulary <G>....

> C Would you use Solr over say Google Maps api to run location aware
> searches?
*shrugs*

> D How long should we expect it to take to configure Solr on our servers
> with our db, get the initial index set up, and enable live search results?
>  Are we talking one week, or one month? Our db is not tiny, but it's not
> huge - say around 8k records in each of ~20 tables. Most tables have around
> 10 fields, including at least one large text field and then a variety of
> dates, numbers, and small text.
Too many variables for you to count on this estimate, but:
*If* you can use Data Import Handler and starting from scratch, probably a week.
Someone who already knows Solr maybe a day. But whenever I start something
new, I usually chase a number of blind alleys.

Once set up, indexing your entire corpus will probably be a matter of
less than an hour (and I'm being quite conservative here. On my laptop,
Solr can index 7K documents/second from the English wiki dump). But
at times the database connection is the limiting factor....

By the way, I recommend that if DIH starts getting hard to use, especially
due to the relationships between tables, consider jumping to SolrJ earlier
rather than later.

Your index size is pretty small by Solr standards, so you probably won't have
to shard or do some of the other complex kinds of things that come up when
you have lots of data.

Note that this is *just* for setting up Solr and being able to query
through, say,
the admin page. It does not exclude all the work for the UI you'll need to front
the app. Count on tweaking your configuration files (e.g. schema.xml and
solrconfig.xml) and re-indexing multiple times before you're satisfied. Count
on a couple of days at least getting a preliminary understanding of the
analysis chains and what using various filters and tokenizers means to
the search process. Using them is trivial, just a config file edit. But knowing
why you'd use WordDelimiterFilterFactory for instance (one of dozens of possible
filters) can cause some head scratching. Count on some time spend
tuning relevancy. Count on your product manager complaining <G>....

At this point, you'll be getting back XML versions of the documents
from Solr (or
JSON or several others). They're easy to parse, but that's outside Solr.

Note that many of my cautions are NOT peculiar to Solr. The search space
is quite different from the RDBMS space and has its own special
gotchas. You'll do lots and lots of head scratching no matter what you try
to use to search. Personally I think Solr is comparatively easy to make
do its tricks, but I'm a bit biased......

Best
Erick

>
> I really appreciate any advice you can offer!
> Cheers,
> Becky
> http://www.coffeeandpower.com