You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Colin R <co...@dasmail.co.uk> on 2014/03/19 11:55:46 UTC
Newbie Question: Master Index or 100s Small Index
We run a central database of 14M (and growing) photos with dates, captions,
keywords, etc.
We currently upgrading from old Lucene Servers to latest Solr running with a
couple of dedicated servers (6 core, 36GB, 500SSD). Planning on using Solr
Cloud.
We take in thousands of changes each day (big and small) so indexing may be
a bigger problem than searching.
My question is an architecture one.
These photos are currently indexed and searched in three ways.
1: The 14M pictures from above are split into a few hundred indexes that
feed a single website. This means index sizes of between 100 and 500,000
entries each.
2: 95% of these same photos are also wanted for searching on a global site.
Index size of 12M plus.
3: 80% of these same photos are also required for smaller group sites. Index
sizes of between 400K and 4M.
We currently make changes the single indexes and then merge into groups and
global. Due to the size of the numbers, is it worth changing or not.
Is it quicker/better to just have one big 14M index and filter the
complexities for each website or is it better to still maintain hundreds of
indexes so we are searching smaller one. Bear in mind, we get thousands of
changes a day PLUS very busy search servers.
Thanks
Col
--
View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie Question: Master Index or 100s Small Index
Posted by Erick Erickson <er...@gmail.com>.
Oh My. 2(something) is ancient, I second your move
to scrap the current situation and start over. I'm
really curious what the _reason_ for such a complex
setup are/were.
I second Toke's comments. This is actually
quite small by modern Solr/Lucene standards.
Personally I would index them all to a single index,
include something like a 'source' field that allowed
one to restrict the returned documents by a filter
query (fq) clause.
Toke makes the point that you will get subtly different
search results because the tf/idf calculations are
slightly different across your entire corpus than
within various sub-sections, but I suspect that you
won't notice it. Test and see, you can change later.
One thing to look at is the new hard/soft commit
distinction, see:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
The short form is you want to define your hard
autocommit to be fairly short (maybe 1 minute?)
with openSearcher=false for durability and your
soft commit whatever latency you need for being
able to search the newly-added docs.
I don't know how you're feeding docs to Solr, but
if you're using the ExtractingRequestHandler,
you are
1> transmitting the entire document over the wire,
only to throw most of it away. I'm guessing your 1.5K
of data is just a few percent of the total file size.
2> you're putting the extraction work on the same
box running Solr.
If that machine is overloaded, consider moving the Tika
processing over to one or more clients and only
sending the data you actually want to index over to Solr,
See:
http://searchhub.org/2012/02/14/indexing-with-solrj/
Best,
Erick
On Wed, Mar 19, 2014 at 7:02 AM, Colin R <co...@dasmail.co.uk> wrote:
> Hi Toke
>
> Our current configuration Lucene 2.(something) with RAILO/CFML app server.
>
> 10K drives, Quad Core, 16GB, Two servers. But the indexing and searching are
> starting to fail and our developer is no longer with us so it is quicker to
> rebuild than fix all the code.
>
> Our existing config is lots of indexes with merges into the larger ones.
>
> They are still running very fast but indexing is causing us issues.
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407p4125447.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie Question: Master Index or 100s Small Index
Posted by Colin R <co...@dasmail.co.uk>.
Hi Toke
Our current configuration Lucene 2.(something) with RAILO/CFML app server.
10K drives, Quad Core, 16GB, Two servers. But the indexing and searching are
starting to fail and our developer is no longer with us so it is quicker to
rebuild than fix all the code.
Our existing config is lots of indexes with merges into the larger ones.
They are still running very fast but indexing is causing us issues.
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407p4125447.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie Question: Master Index or 100s Small Index
Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2014-03-19 at 13:28 +0100, Colin R wrote:
> My question is really regarding index architecture. One big or many small
> (with merged big ones)
One difference is that having a single index/collection gives you better
ranked searches within each collection. If you only use date/filename
sorting, that is of course irrelevant.
> In terms of bytes, each photo has a up to 1.5KB of data.
So about 20GB for the full index?
> Special requirements are search by date range, text, date range and text.
> Plus some boolean filtering. All results can be sorted by date or filename.
With no faceting, grouping or similar aggregating processing,
(re)opening of an index searcher should be very fast. The only thing
that takes a moment is the initial date or filename sorting. Asking for
minute-level data updates is thus very modest. With the information you
have given, you could aim for a few seconds.
None of the things you have said gives any cause for concern about
performance and even though you have an existing system running and is
upgrading to a presumably faster one, you sound concerned. Do you
currently have performance problems, and if so, what is your current
hardware?
- Toke Eskildsen, State and University Library, Denmark
Re: Newbie Question: Master Index or 100s Small Index
Posted by Colin R <co...@dasmail.co.uk>.
Hi Toke
Thanks for replying.
My question is really regarding index architecture. One big or many small
(with merged big ones)
We probably get 5-10K photos added each day. Others are updated, some are
deleted.
Updates need to happen quite fast (e.g. within minutes of our Databases
receiving them).
In terms of bytes, each photo has a up to 1.5KB of data.
Special requirements are search by date range, text, date range and text.
Plus some boolean filtering. All results can be sorted by date or filename.
--
View this message in context: http://lucene.472066.n3.nabble.com/Newbie-Question-Master-Index-or-100s-Small-Index-tp4125407p4125429.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie Question: Master Index or 100s Small Index
Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2014-03-19 at 11:55 +0100, Colin R wrote:
> We run a central database of 14M (and growing) photos with dates, captions,
> keywords, etc.
>
> We currently upgrading from old Lucene Servers to latest Solr running with a
> couple of dedicated servers (6 core, 36GB, 500SSD). Planning on using Solr
> Cloud.
What hardware are your past experiences based on? If they have less
cores, lower memory and spinning drives, I foresee that your question
can be reduced to which architecture you prefer from a logistic point of
view, rather than performance.
> We take in thousands of changes each day (big and small) so indexing may be
> a bigger problem than searching.
Thousands of updates in a day is a very low number. Do you have hard
requirements for update time, perform heavy faceting or do anything
special for this to be a cause of concern?
> Is it quicker/better to just have one big 14M index and filter the
> complexities for each website or is it better to still maintain hundreds of
> indexes so we are searching smaller one.
All else being equal, a search in a specific small index will be faster
than filtering on the large one. But as we know, all else is never
equal. A 14M document index in itself is not really a challenge for
Lucene/Solr, but this depends a lot on your specific setup. How large is
the 14M index in terms of bytes?
> Bear in mind, we get thousands of changes a day PLUS very busy search servers.
How many queries/second are we talking about here? What is a typical
query (faceting, grouping, special processing...)?
Regards,
Toke Eskildsen, State and University Library, Denmark
Re: Newbie Question: Master Index or 100s Small Index
Posted by Shawn Heisey <so...@elyograg.org>.
On 3/19/2014 4:55 AM, Colin R wrote:
> My question is an architecture one.
>
> These photos are currently indexed and searched in three ways.
>
> 1: The 14M pictures from above are split into a few hundred indexes that
> feed a single website. This means index sizes of between 100 and 500,000
> entries each.
>
> 2: 95% of these same photos are also wanted for searching on a global site.
> Index size of 12M plus.
>
> 3: 80% of these same photos are also required for smaller group sites. Index
> sizes of between 400K and 4M.
>
> We currently make changes the single indexes and then merge into groups and
> global. Due to the size of the numbers, is it worth changing or not.
>
> Is it quicker/better to just have one big 14M index and filter the
> complexities for each website or is it better to still maintain hundreds of
> indexes so we are searching smaller one. Bear in mind, we get thousands of
> changes a day PLUS very busy search servers.
My primary use for Solr is an archive of 92 million documents, most of
which are photos. We have thousands of new photos every day. I haven't
been cleared to mention what company it's for.
This screenshot of my status servlet page answers tons of questions
about my index, but if you have additional questions, ask:
https://www.dropbox.com/s/6p1puq1gq3j8nln/solr-status-servlet.png
Here are some details about each host that you cannot see in the
screenshot: 6 SATA disks in RAID10 with 3TB of usable space. 64GB of
RAM. Dual quad-core Intel E54xx series CPUs.Chain A is running Solr
4.2.1 on Java 6, chain B is running Solr 4.6.1 on Java 7, with some
additional plugin software that increases the index size. There is one
Solr process per host, with a 6GB heap.
As long as you index fields that can be used to filter searches
according to what a user is allowed to see, I don't see any problem with
putting all of your data into one index.The main thing you'll want to be
sure of is that you have enough RAM to effectively cache your index.
Because you have SSD, you probably don't need to have enough RAM to
cache ALL of the index data, but it wouldn't hurt. With 36GB of RAM per
machine, you will probably have enough.
Thanks,
Shawn