You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Colin R <co...@dasmail.co.uk> on 2014/03/19 11:55:46 UTC

Newbie Question: Master Index or 100s Small Index

We run a central database of 14M (and growing) photos with dates, captions,
keywords, etc. 

We currently upgrading from old Lucene Servers to latest Solr running with a
couple of dedicated  servers (6 core, 36GB, 500SSD). Planning on using Solr
Cloud.

We take in thousands of changes each day (big and small) so indexing may be
a bigger problem than searching.

My question is an architecture one.

These photos are currently indexed and searched in three ways.

1: The 14M pictures from above are split into a few hundred indexes that
feed a single website. This means index sizes of between 100 and 500,000
entries each.

2: 95% of these same photos are also wanted for searching on a global site.
Index size of 12M plus.

3: 80% of these same photos are also required for smaller group sites. Index
sizes of between 400K and 4M.

We currently make changes the single indexes and then merge into groups and
global. Due to the size of the numbers, is it worth changing or not.

Is it quicker/better to just have one big 14M index and filter the
complexities for each website or is it better to still maintain hundreds of
indexes so we are searching smaller one. Bear in mind, we get thousands of
changes a day PLUS very busy search servers.

Thanks

Col



--
View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Newbie Question: Master Index or 100s Small Index

Posted by Erick Erickson <er...@gmail.com>.

Oh My. 2(something) is ancient, I second your move
to scrap the current situation and start over. I'm
really curious what the _reason_ for such a complex
setup are/were.

I second Toke's comments. This is actually
quite small by modern Solr/Lucene standards.

Personally I would index them all to a single index,
include something like a 'source' field that allowed
one to restrict the returned documents by a filter
query (fq) clause.

Toke makes the point that you will get subtly different
search results because the tf/idf calculations are
slightly different across your entire corpus than
within various sub-sections, but I suspect that you
won't notice it. Test and see, you can change later.

One thing to look at is the new hard/soft commit
distinction, see:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

The short form is you want to define your hard
autocommit to be fairly short (maybe 1 minute?)
with openSearcher=false for durability and your
soft commit whatever latency you need for being
able to search the newly-added docs.

I don't know how you're feeding docs to Solr, but
if you're using the ExtractingRequestHandler,
you are
1> transmitting the entire document over the wire,
only to throw most of it away. I'm guessing your 1.5K
of data is just a few percent of the total file size.
2> you're putting the extraction work on the same
box running Solr.

If that machine is overloaded, consider moving the Tika
processing over to one or more clients and only
sending the data you actually want to index over to Solr,
See:
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Wed, Mar 19, 2014 at 7:02 AM, Colin R <co...@dasmail.co.uk> wrote:
> Hi Toke
>
> Our current configuration Lucene 2.(something) with RAILO/CFML app server.
>
> 10K drives, Quad Core, 16GB, Two servers. But the indexing and searching are
> starting to fail and our developer is no longer with us so it is quicker to
> rebuild than fix all the code.
>
> Our existing config is lots of indexes with merges into the larger ones.
>
> They are still running very fast but indexing is causing us issues.
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407p4125447.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Newbie Question: Master Index or 100s Small Index

Posted by Colin R <co...@dasmail.co.uk>.

Hi Toke

Our current configuration Lucene 2.(something) with RAILO/CFML app server.

10K drives, Quad Core, 16GB, Two servers. But the indexing and searching are
starting to fail and our developer is no longer with us so it is quicker to
rebuild than fix all the code.

Our existing config is lots of indexes with merges into the larger ones.

They are still running very fast but indexing is causing us issues.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407p4125447.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Newbie Question: Master Index or 100s Small Index

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Wed, 2014-03-19 at 13:28 +0100, Colin R wrote:
> My question is really regarding index architecture. One big or many small
> (with merged big ones)

One difference is that having a single index/collection gives you better
ranked searches within each collection. If you only use date/filename
sorting, that is of course irrelevant.

> In terms of bytes, each photo has a up to 1.5KB of data.

So about 20GB for the full index?

> Special requirements are search by date range, text, date range and text.
> Plus some boolean filtering. All results can be sorted by date or filename.

With no faceting, grouping or similar aggregating processing,
(re)opening of an index searcher should be very fast. The only thing
that takes a moment is the initial date or filename sorting. Asking for
minute-level data updates is thus very modest. With the information you
have given, you could aim for a few seconds.

None of the things you have said gives any cause for concern about
performance and even though you have an existing system running and is
upgrading to a presumably faster one, you sound concerned. Do you
currently have performance problems, and if so, what is your current
hardware?

- Toke Eskildsen, State and University Library, Denmark

Re: Newbie Question: Master Index or 100s Small Index

Posted by Colin R <co...@dasmail.co.uk>.

Hi Toke

Thanks for replying.

My question is really regarding index architecture. One big or many small
(with merged big ones)

We probably get 5-10K photos added each day. Others are updated, some are
deleted.

Updates need to happen quite fast (e.g. within minutes of our Databases
receiving them).

In terms of bytes, each photo has a up to 1.5KB of data.

Special requirements are search by date range, text, date range and text.
Plus some boolean filtering. All results can be sorted by date or filename.



--
View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407p4125429.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Newbie Question: Master Index or 100s Small Index

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Wed, 2014-03-19 at 11:55 +0100, Colin R wrote:
> We run a central database of 14M (and growing) photos with dates, captions,
> keywords, etc. 
> 
> We currently upgrading from old Lucene Servers to latest Solr running with a
> couple of dedicated  servers (6 core, 36GB, 500SSD). Planning on using Solr
> Cloud.

What hardware are your past experiences based on? If they have less
cores, lower memory and spinning drives, I foresee that your question
can be reduced to which architecture you prefer from a logistic point of
view, rather than performance.

> We take in thousands of changes each day (big and small) so indexing may be
> a bigger problem than searching.

Thousands of updates in a day is a very low number. Do you have hard
requirements for update time, perform heavy faceting or do anything
special for this to be a cause of concern?

> Is it quicker/better to just have one big 14M index and filter the
> complexities for each website or is it better to still maintain hundreds of
> indexes so we are searching smaller one.

All else being equal, a search in a specific small index will be faster
than filtering on the large one. But as we know, all else is never
equal. A 14M document index in itself is not really a challenge for
Lucene/Solr, but this depends a lot on your specific setup. How large is
the 14M index in terms of bytes?

> Bear in mind, we get thousands of changes a day PLUS very busy search servers.

How many queries/second are we talking about here? What is a typical
query (faceting, grouping, special processing...)?

Regards,
Toke Eskildsen, State and University Library, Denmark

Re: Newbie Question: Master Index or 100s Small Index

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/19/2014 4:55 AM, Colin R wrote:
> My question is an architecture one.
>
> These photos are currently indexed and searched in three ways.
>
> 1: The 14M pictures from above are split into a few hundred indexes that
> feed a single website. This means index sizes of between 100 and 500,000
> entries each.
>
> 2: 95% of these same photos are also wanted for searching on a global site.
> Index size of 12M plus.
>
> 3: 80% of these same photos are also required for smaller group sites. Index
> sizes of between 400K and 4M.
>
> We currently make changes the single indexes and then merge into groups and
> global. Due to the size of the numbers, is it worth changing or not.
>
> Is it quicker/better to just have one big 14M index and filter the
> complexities for each website or is it better to still maintain hundreds of
> indexes so we are searching smaller one. Bear in mind, we get thousands of
> changes a day PLUS very busy search servers.

My primary use for Solr is an archive of 92 million documents, most of 
which are photos.  We have thousands of new photos every day.  I haven't 
been cleared to mention what company it's for.

This screenshot of my status servlet page answers tons of questions 
about my index, but if you have additional questions, ask:

https://www.dropbox.com/s/6p1puq1gq3j8nln/solr-status-servlet.png

Here are some details about each host that you cannot see in the 
screenshot: 6 SATA disks in RAID10 with 3TB of usable space.  64GB of 
RAM.  Dual quad-core Intel E54xx series CPUs.Chain A is running Solr 
4.2.1 on Java 6, chain B is running Solr 4.6.1 on Java 7, with some 
additional plugin software that increases the index size.  There is one 
Solr process per host, with a 6GB heap.

As long as you index fields that can be used to filter searches 
according to what a user is allowed to see, I don't see any problem with 
putting all of your data into one index.The main thing you'll want to be 
sure of is that you have enough RAM to effectively cache your index.  
Because you have SSD, you probably don't need to have enough RAM to 
cache ALL of the index data, but it wouldn't hurt.  With 36GB of RAM per 
machine, you will probably have enough.

Thanks,
Shawn