You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ikhsvaku S <ik...@gmail.com> on 2011/09/25 22:00:09 UTC

Seek your wisdom for implementing 12 million docs..

Hi List,

We are pretty new to Solr & Lucene and have just starting indexing few 10K
documents using Solr. Before we attempt anything bigger we want to see what
should be the best approach..

Documents: We have close to ~12 million XML docs, of varying sizes average
size 20 KB. These documents have 150 fields, which should be searchable &
indexed. Of which over 80% fixed length string fields and few strings are
multivalued ones (e.g. title, headline, id, submitter, reviewers,
suggested-titles etc), there other 15% who are date specific (added-on,
reviewed-on etc). Rest are multivalued text documents, (E,g,
description, summary, comments, notes etc). Some of the documents do have
large number of these text fields (so we are leaning against storing these
in index). Approximately ~6000 such documents are updated & 400-800 new ones
are added each day

Queries: A typical query would mainly be on string fields ~ 60% of queries
e.g. a simple one would be find document ids of documents whose author is
XYZ & submitted between [X-Z] & whose status is reviewed or pending review
&& title has this string etc... the results of which are exacting nature
(found 300 docs). Rest of searches would include the text fields, where they
search quoted snippets or phrases... Almost all queries have multiple
operators. Also each one would want to grab as many result rows as possible
(we are limiting this to 2000). The output shall contain only 1-5 fields.
(No highlighting etc needed)

Available hardware:
Some of existing hardware we could find consists of existing ~300GB SAN each
on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
use for offline indexing). All of this is on 10G Ethernet.

Questions:
Our priority is to provide results fast, and the new or updated documents
should be indexed within 2 hour. Users are also known to use complex queries
for data mining. Seeing all this any recommendations for indexing data,
fields?
How do we scale, what architecture should we follow here? Slave/master
servers? Any possible issues we may hit?

Thanks

Re: Seek your wisdom for implementing 12 million docs..

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Sun, 2011-09-25 at 22:00 +0200, Ikhsvaku S wrote:
> Documents: We have close to ~12 million XML docs, of varying sizes average
> size 20 KB. These documents have 150 fields, which should be searchable &
> indexed. [...] Approximately ~6000 such documents are updated & 400-800 new ones
> are added each day
>
> Queries: [...] Also each one would want to grab as many result rows as possible
> (we are limiting this to 2000). The output shall contain only 1-5 fields.

Except for the result rows (which I guess is equal to returned documents
in Solr-world), nothing you say raises any alarms. It actually sounds
very much like our local index (~10M documents, ~100 fields, 10.000+
updates/day) at the State and University Library, Denmark.

> Available hardware:
> Some of existing hardware we could find consists of existing ~300GB SAN each
> on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
> use for offline indexing). All of this is on 10G Ethernet.

Yikes! We only use two mirrored machines for fallback, not performance.
They have 16GB each and handle index updates as well as searches. The
indexes (~60GB) reside on local SSDs.

> Questions:
> Our priority is to provide results fast, [...]

What is fast in milliseconds and how many queries/second do you
anticipate? From what you're telling, your hardware looks like overkill.
However, as Eric says, your mileage may wary: Try stuffing all your data
into your mock-up and see what happens - it shouldn't take long and you
might discover that your test machine is perfectly capable of handling
it all alone.

Re: Seek your wisdom for implementing 12 million docs..

Posted by Erick Erickson <er...@gmail.com>.

Round N + 1 of "it depends" <G>. This isn't a very big
index as Solr indexes go, my first guess would be that
you can easily fit this on the machines you're talking
about. But, as always, how you implement things may
prove me wrong.

Really, about the only thing you can do is try it. Be
aware that "the size of the index" is a tricky concept.
For instance, if you store your data (stored="true"), the
files in your index directory will NOT reflect the total memory
requirements since verbatim copies of your fields are
held in the *.fdt files and really don't affect searching speed.

Here's what I claim:
1> you can index these 12M document in a reasonable time. I
     index on my Mac book Pro 1.9M documents (Wikipedia
     dump) in just a few minutes (< 10 as I remember). So you can
     "just try things".
2> use a Master/Slave architecture. You can control how fast
     the updates are available by the polling interval on the slave
     and how fast you commit. 2 hours is easy. 10 minutes is
     a reasonable goal here.
3> Consider edismax-style handlers. The point here is that
     they allow you to tune relevance much more finely than
     a "bag of words" approach in which you index many fields
     into a single text field.
4> You only really need to store the fields you intend to display
      as part of your search results. Assuming you're going to
      your system-of-record for the full document, your stored
      data may be very small
5> Be aware that the first few queries will often be much slower
     than later queries, as there are certain caches that need to
     be filled up. See the various warming parameters on the
     caches and the "firstSearcher" and "newSearcher" entries
     in the config files.
6> Create a mix of queries and use something like jMeter or
     SolrMeter to determine where your target hardware
     falls down. You have to take some care to create a reasonable
     query set, not just the same query over and over or you'll
     just get cached results. Fire enough queries at the searcher
     that it starts to perform poorly and tweak from there.
7> Really, really get familiar with two things,
    a> the admin/analysis page for understanding the analysis
          process.
    b> adding &debugQuery=on to your queries when you don't
         understand what's happening. In particular, that will show
         you the parsed queries, you can defer digging into the
         scoring explanations for later.
8> string types aren't what you want very often. They're really
     suitable for things like IDs, serial numbers, etc. But they are
     NOT tokenized. So if your input is "some stuff" and you search
     for "stuff", you won't get a match. This often confuses people.
     For tokenized processing, you'll probably want one of the
     "text" variants. String types are even case sensitive...

But all in all, I don't see what you've described as particularly
difficult, although you'll doubtlessly run into things you don't
expect.

Hope that helps
Erick

On Sun, Sep 25, 2011 at 1:00 PM, Ikhsvaku S <ik...@gmail.com> wrote:
> Hi List,
>
> We are pretty new to Solr & Lucene and have just starting indexing few 10K
> documents using Solr. Before we attempt anything bigger we want to see what
> should be the best approach..
>
> Documents: We have close to ~12 million XML docs, of varying sizes average
> size 20 KB. These documents have 150 fields, which should be searchable &
> indexed. Of which over 80% fixed length string fields and few strings are
> multivalued ones (e.g. title, headline, id, submitter, reviewers,
> suggested-titles etc), there other 15% who are date specific (added-on,
> reviewed-on etc). Rest are multivalued text documents, (E,g,
> description, summary, comments, notes etc). Some of the documents do have
> large number of these text fields (so we are leaning against storing these
> in index). Approximately ~6000 such documents are updated & 400-800 new ones
> are added each day
>
> Queries: A typical query would mainly be on string fields ~ 60% of queries
> e.g. a simple one would be find document ids of documents whose author is
> XYZ & submitted between [X-Z] & whose status is reviewed or pending review
> && title has this string etc... the results of which are exacting nature
> (found 300 docs). Rest of searches would include the text fields, where they
> search quoted snippets or phrases... Almost all queries have multiple
> operators. Also each one would want to grab as many result rows as possible
> (we are limiting this to 2000). The output shall contain only 1-5 fields.
> (No highlighting etc needed)
>
> Available hardware:
> Some of existing hardware we could find consists of existing ~300GB SAN each
> on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
> use for offline indexing). All of this is on 10G Ethernet.
>
> Questions:
> Our priority is to provide results fast, and the new or updated documents
> should be indexed within 2 hour. Users are also known to use complex queries
> for data mining. Seeing all this any recommendations for indexing data,
> fields?
> How do we scale, what architecture should we follow here? Slave/master
> servers? Any possible issues we may hit?
>
> Thanks
>